-
Notifications
You must be signed in to change notification settings - Fork 955
Make gossip map more robust #8566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rustyrussell
merged 14 commits into
ElementsProject:master
from
rustyrussell:guilt/gossip-map-more-robust
Oct 1, 2025
Merged
Make gossip map more robust #8566
rustyrussell
merged 14 commits into
ElementsProject:master
from
rustyrussell:guilt/gossip-map-more-robust
Oct 1, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
97f804d
to
1990e37
Compare
Still crashing for me:
|
1990e37
to
58aabf0
Compare
No crash with the latest commits. I've tried wiping |
Signed-off-by: Rusty Russell <[email protected]>
This can happen with other subdaemons too, on ZFS on Linux: ``` 2025-09-24T13:51:22.703Z **BROKEN** connectd: Bad checksum on gossmap record @9850670/9851114 should be 3379961343 (01009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aaead1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c58feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 ``` Reported-by: @grubles Signed-off-by: Rusty Russell <[email protected]>
We might have not read the final entry. Signed-off-by: Rusty Russell <[email protected]>
It only gets called for diagnostics when something goes wrong (and we were going to exit anyway), and it's only useful with mmap (which we now disable on error) but it shouldn't crash: ``` **BROKEN** gossipd: Truncated gossmap record @7991501/7991523 (len 0): waiting **BROKEN** gossipd: FATAL SIGNAL 6 (version v25.09) **BROKEN** gossipd: backtrace: common/daemon.c:41 (send_backtrace) 0x6506817cc529 **BROKEN** gossipd: backtrace: common/daemon.c:78 (crashdump) 0x6506817cc578 **BROKEN** gossipd: backtrace: ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 ((null)) 0x75e8267a032f **BROKEN** gossipd: backtrace: ./nptl/pthread_kill.c:44 (__pthread_kill_implementation) 0x75e8267f9b2c **BROKEN** gossipd: backtrace: ./nptl/pthread_kill.c:78 (__pthread_kill_internal) 0x75e8267f9b2c **BROKEN** gossipd: backtrace: ./nptl/pthread_kill.c:89 (__GI___pthread_kill) 0x75e8267f9b2c **BROKEN** gossipd: backtrace: ../sysdeps/posix/raise.c:26 (__GI_raise) 0x75e8267a027d **BROKEN** gossipd: backtrace: ./stdlib/abort.c:79 (__GI_abort) 0x75e8267838fe **BROKEN** gossipd: backtrace: ./assert/assert.c:96 (__assert_fail_base) 0x75e82678381a **BROKEN** gossipd: backtrace: ./assert/assert.c:105 (__assert_fail) 0x75e826796516 **BROKEN** gossipd: backtrace: common/gossmap.c:111 (map_copy) 0x6506817cea77 **BROKEN** gossipd: backtrace: common/gossmap.c:1870 (gossmap_fetch_tail) 0x6506817d1f93 **BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:1442 (gossmap_manage_get_gossmap) 0x6506817c45fb **BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:753 (gossmap_manage_handle_get_txout_reply) 0x6506817c5850 **BROKEN** gossipd: backtrace: gossipd/gossipd.c:574 (recv_req) 0x6506817c172b ``` Reported-by: @grubles Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
This should detect partial writes more robustly, since we make a separate pwrite() call to update this flag after the record is written. Previously we were playing a bit loose with synchronization assumptions, which seemed to work on Linux ext4, but not so well elsewhere. Signed-off-by: Rusty Russell <[email protected]>
2bcd99f
to
cf8feea
Compare
It was still using private channel announcements, which were removed in v13.
54cc00f
to
9441853
Compare
Signed-off-by: Rusty Russell <[email protected]>
…D_BIT set. Mostly this meant running them, then running devtools/convert-gossmap and replacing the code. Signed-off-by: Rusty Russell <[email protected]>
…E_COMPLETED_BIT set. Simply ran them through devtools/convert-gossmap, thought for gossip_store-part2 it had to be appended to gossip_store-part1, converted, then cut off again. Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
…et a read issue. This is a last resort, but what else are we supposed to do when we wrote something and it didn't appear? In particular, ZFS doesn't just "fix itself": ``` remaining_fd=200001b0c9761dff0000000001009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6 bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aae ad1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c5 8feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000002000000a218b9d93000000001005000000000000c060 ``` Note the record appended on the end *after all the zeroes*. Changelog-Changed: gossipd: add gossip_store recovery for filesystems which do not synchronize read and write (e.g. ZFS on Linux), by disabling mmap reads and rewriting the last records. Signed-off-by: Rusty Russell <[email protected]>
gossipd now uses pwrite(), which is more broadly supported. Signed-off-by: Rusty Russell <[email protected]>
9441853
to
1db1b92
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes: #8542