Make gossip map more robust #8566

rustyrussell · 2025-09-23T02:35:11Z

grubles · 2025-09-24T17:16:40Z

Still crashing for me:

2025-09-24T13:51:22.703Z **BROKEN** connectd: Bad checksum on gossmap record @9850670/9851114 should be 3379961343 (01009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aaead1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c58feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000): waiting                                                                                          2025-09-24T13:51:22.703Z **BROKEN** connectd: Bad checksum on gossmap record @9850670/9851136 should be 3379961343 (01009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aaead1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c58feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe3000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000): waiting                                                                                          
0x623de0e30611 send_backtrace                                                                                           
        common/daemon.c:33                                                                                              
0x623de0e3b0d2 status_failed                                                                                            
        common/status.c:206                                                                                             
0x623de0e2869e gossmap_manage_get_gossmap                                                                               
        gossipd/gossmap_manage.c:1460
0x623de0e29955 gossmap_manage_handle_get_txout_reply                                                                    
        gossipd/gossmap_manage.c:753                                                                                    
0x623de0e2572b recv_req                                                                                                 
        gossipd/gossipd.c:574
0x623de0e30945 handle_read                                                                                              
        common/daemon_conn.c:35
0x623de0eca8d7 next_plan                                                                                                
        ccan/ccan/io/io.c:60                                                                                            
0x623de0ecada8 do_plan                                                                                                  
        ccan/ccan/io/io.c:422                                                                                           
0x623de0ecae65 io_ready                                                                                                 
        ccan/ccan/io/io.c:439                                                                                           
0x623de0ecc7d7 io_loop                                                                                                  
        ccan/ccan/io/poll.c:455                                                                                         
0x623de0e261cd main
        gossipd/gossipd.c:663                                                                                           
0x7b19c33621c9 __libc_start_call_main                                                                                   
        ../sysdeps/nptl/libc_start_call_main.h:58
0x7b19c336228a __libc_start_main_impl
        ../csu/libc-start.c:360
0x623de0e22ed4 ???
        _start+0x24:0
0xffffffffffffffff ???
        ???:0
2025-09-24T13:51:23.141Z **BROKEN** gossipd: Bad checksum on gossmap record @9850670/9851136 should be 3379961343 (01009
411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6bb37a4dea93776f4abc8cd371525b4d1605a74b8
9d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aaead1d65e7889e826ea0ba42f7746c176fe12f2fe6
c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c58feabce4173c4ce6098a2c5397aabf1be5442cb6
7b5030be11ebd8b9841838dae127fe300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000): waiting
2025-09-24T13:51:23.141Z **BROKEN** gossipd: Gossmap failed to process entire gossip_store, disabling mmap: at 9850670 o
f 9851136 remaining_mmap=00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 re
maining_fd=200001b0c9761dff0000000001009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6
bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aae
ad1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c5
8feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000002000000a218b9d93000000001005000000000000c060
2025-09-24T13:51:23.141Z **BROKEN** gossipd: Gossmap map_used 9850670 of 9851136 with 9851136 written (version v25.09-70
-g1990e37)
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: common/daemon.c:41 (send_backtrace) 0x623de0e3065e
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: common/status.c:206 (status_failed) 0x623de0e3b0d2
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:1460 (gossmap_manage_get_gossmap) 0x623
de0e2869e
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:753 (gossmap_manage_handle_get_txout_re
ply) 0x623de0e29955
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:574 (recv_req) 0x623de0e2572b
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: common/daemon_conn.c:35 (handle_read) 0x623de0e30945
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:60 (next_plan) 0x623de0eca8d7
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:422 (do_plan) 0x623de0ecada8
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:439 (io_ready) 0x623de0ecae65
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ccan/ccan/io/poll.c:455 (io_loop) 0x623de0ecc7d7
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:663 (main) 0x623de0e261cd
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_mai
n) 0x7b19c33621c9
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ../csu/libc-start.c:360 (__libc_start_main_impl) 0x7b19c336228a
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x623de0e22ed4
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0xffffffffffffffff
2025-09-24T13:51:23.141Z **BROKEN** gossipd: STATUS_FAIL_INTERNAL_ERROR: Gossmap map_used 9850670 of 9851136 with 985113
6 written
lightningd: gossipd failed (exit status 242), exiting.
Lost connection to the RPC socket.Lost connection to the RPC socket.Lost connection to the RPC socket.Lost connection to
 the RPC socket.Lost connection to the RPC socket.Lost connection to the RPC socket.Lost connection to the RPC socket.Lo
st connection to the RPC socket.Lost connection to the RPC socket.Lost connection to the RPC socket.Lost connection to t

grubles · 2025-09-27T11:42:44Z

No crash with the latest commits. I've tried wiping gossip_store and re-syncing a few times to be sure.

Signed-off-by: Rusty Russell <[email protected]>

@grubles

This can happen with other subdaemons too, on ZFS on Linux: ``` 2025-09-24T13:51:22.703Z **BROKEN** connectd: Bad checksum on gossmap record @9850670/9851114 should be 3379961343 (01009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aaead1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c58feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 ``` Reported-by: @grubles Signed-off-by: Rusty Russell <[email protected]>

We might have not read the final entry. Signed-off-by: Rusty Russell <[email protected]>

@grubles

It only gets called for diagnostics when something goes wrong (and we were going to exit anyway), and it's only useful with mmap (which we now disable on error) but it shouldn't crash: ``` **BROKEN** gossipd: Truncated gossmap record @7991501/7991523 (len 0): waiting **BROKEN** gossipd: FATAL SIGNAL 6 (version v25.09) **BROKEN** gossipd: backtrace: common/daemon.c:41 (send_backtrace) 0x6506817cc529 **BROKEN** gossipd: backtrace: common/daemon.c:78 (crashdump) 0x6506817cc578 **BROKEN** gossipd: backtrace: ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 ((null)) 0x75e8267a032f **BROKEN** gossipd: backtrace: ./nptl/pthread_kill.c:44 (__pthread_kill_implementation) 0x75e8267f9b2c **BROKEN** gossipd: backtrace: ./nptl/pthread_kill.c:78 (__pthread_kill_internal) 0x75e8267f9b2c **BROKEN** gossipd: backtrace: ./nptl/pthread_kill.c:89 (__GI___pthread_kill) 0x75e8267f9b2c **BROKEN** gossipd: backtrace: ../sysdeps/posix/raise.c:26 (__GI_raise) 0x75e8267a027d **BROKEN** gossipd: backtrace: ./stdlib/abort.c:79 (__GI_abort) 0x75e8267838fe **BROKEN** gossipd: backtrace: ./assert/assert.c:96 (__assert_fail_base) 0x75e82678381a **BROKEN** gossipd: backtrace: ./assert/assert.c:105 (__assert_fail) 0x75e826796516 **BROKEN** gossipd: backtrace: common/gossmap.c:111 (map_copy) 0x6506817cea77 **BROKEN** gossipd: backtrace: common/gossmap.c:1870 (gossmap_fetch_tail) 0x6506817d1f93 **BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:1442 (gossmap_manage_get_gossmap) 0x6506817c45fb **BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:753 (gossmap_manage_handle_get_txout_reply) 0x6506817c5850 **BROKEN** gossipd: backtrace: gossipd/gossipd.c:574 (recv_req) 0x6506817c172b ``` Reported-by: @grubles Signed-off-by: Rusty Russell <[email protected]>

Signed-off-by: Rusty Russell <[email protected]>

This should detect partial writes more robustly, since we make a separate pwrite() call to update this flag after the record is written. Previously we were playing a bit loose with synchronization assumptions, which seemed to work on Linux ext4, but not so well elsewhere. Signed-off-by: Rusty Russell <[email protected]>

It was still using private channel announcements, which were removed in v13.

Signed-off-by: Rusty Russell <[email protected]>

…D_BIT set. Mostly this meant running them, then running devtools/convert-gossmap and replacing the code. Signed-off-by: Rusty Russell <[email protected]>

…E_COMPLETED_BIT set. Simply ran them through devtools/convert-gossmap, thought for gossip_store-part2 it had to be appended to gossip_store-part1, converted, then cut off again. Signed-off-by: Rusty Russell <[email protected]>

Signed-off-by: Rusty Russell <[email protected]>

…et a read issue. This is a last resort, but what else are we supposed to do when we wrote something and it didn't appear? In particular, ZFS doesn't just "fix itself": ``` remaining_fd=200001b0c9761dff0000000001009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6 bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aae ad1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c5 8feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000002000000a218b9d93000000001005000000000000c060 ``` Note the record appended on the end *after all the zeroes*. Changelog-Changed: gossipd: add gossip_store recovery for filesystems which do not synchronize read and write (e.g. ZFS on Linux), by disabling mmap reads and rewriting the last records. Signed-off-by: Rusty Russell <[email protected]>

gossipd now uses pwrite(), which is more broadly supported. Signed-off-by: Rusty Russell <[email protected]>

rustyrussell added this to the v25.12 milestone Sep 23, 2025

rustyrussell requested a review from cdecker as a code owner September 23, 2025 02:35

rustyrussell force-pushed the guilt/gossip-map-more-robust branch from 97f804d to 1990e37 Compare September 23, 2025 02:35

madelinevibes added the 25.09.1 Point release for 25.09 label Sep 25, 2025

rustyrussell force-pushed the guilt/gossip-map-more-robust branch from 1990e37 to 58aabf0 Compare September 25, 2025 05:11

rustyrussell added 6 commits September 29, 2025 00:57

gossmap: routine gossmap_disable_mmap() to force read() calls.

587a0ef

Signed-off-by: Rusty Russell <[email protected]>

gossmap: refresh map even if size hasn't changed.

54d1633

We might have not read the final entry. Signed-off-by: Rusty Russell <[email protected]>

common: remove unused push bit.

93b8fd8

Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the guilt/gossip-map-more-robust branch 2 times, most recently from 2bcd99f to cf8feea Compare September 29, 2025 03:34

pyln-client: update ancient gossmap in test_gossmap tests.

714dc17

It was still using private channel announcements, which were removed in v13.

rustyrussell force-pushed the guilt/gossip-map-more-robust branch 2 times, most recently from 54cc00f to 9441853 Compare September 30, 2025 06:59

rustyrussell added 7 commits October 1, 2025 10:48

devtools: create conversion tool for old gossip stores.

2715e26

Signed-off-by: Rusty Russell <[email protected]>

unit tests: update all the gossmaps to have the GOSSIP_STORE_COMPLETE…

f34c9fc

…D_BIT set. Mostly this meant running them, then running devtools/convert-gossmap and replacing the code. Signed-off-by: Rusty Russell <[email protected]>

gossip_store: wait for completed bit on reading.

fa1c5fd

Signed-off-by: Rusty Russell <[email protected]>

gossmap: use gossmap_disable_mmap() on corruption.

1c1c718

Signed-off-by: Rusty Russell <[email protected]>

configure: remove now-unneeded HAVE_PWRITEV.

1db1b92

gossipd now uses pwrite(), which is more broadly supported. Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the guilt/gossip-map-more-robust branch from 9441853 to 1db1b92 Compare October 1, 2025 01:23

rustyrussell merged commit 6af7fc6 into ElementsProject:master Oct 1, 2025
35 of 39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make gossip map more robust #8566

Make gossip map more robust #8566

Uh oh!

rustyrussell commented Sep 23, 2025

Uh oh!

grubles commented Sep 24, 2025

Uh oh!

grubles commented Sep 27, 2025

Uh oh!

Uh oh!

Uh oh!

Make gossip map more robust #8566

Make gossip map more robust #8566

Uh oh!

Conversation

rustyrussell commented Sep 23, 2025

Uh oh!

grubles commented Sep 24, 2025

Uh oh!

grubles commented Sep 27, 2025

Uh oh!

Uh oh!

Uh oh!