Avoid deadlock in syncer.Close #92

peterjan · 2024-08-29T11:26:00Z

I ran into the following deadlock:

2024-08-29T10:15:42.9535023Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"goroutine 1136792 [semacquire, 56 minutes]:\n"}
2024-08-29T10:15:42.9536609Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"sync.runtime_Semacquire(0xc000e7b9b8?)\n"}
2024-08-29T10:15:42.9538434Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tC:/hostedtoolcache/windows/go/1.22.6/x64/src/runtime/sema.go:62 +0x25\n"}
2024-08-29T10:15:42.9540131Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"sync.(*WaitGroup).Wait(0xc000e7b9b0)\n"}
2024-08-29T10:15:42.9541860Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tC:/hostedtoolcache/windows/go/1.22.6/x64/src/sync/waitgroup.go:116 +0xa8\n"}
2024-08-29T10:15:42.9543649Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"go.sia.tech/coreutils/syncer.(*Syncer).Close(0xc000e7b8c0)\n"}
2024-08-29T10:15:42.9545542Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tC:/Users/runneradmin/go/pkg/mod/go.sia.tech/[email protected]/syncer/syncer.go:681 +0x59\n"}
2024-08-29T10:15:42.9547515Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"go.sia.tech/renterd/internal/test/e2e.(*Host).Close(0xc0001114a0)\n"}
2024-08-29T10:15:42.9549330Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tD:/a/renterd/renterd/internal/test/e2e/host.go:162 +0x1f5\n"}
2024-08-29T10:15:42.9551160Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"go.sia.tech/renterd/internal/test/e2e.(*TestCluster).Shutdown(0xc0009addc0)\n"}
2024-08-29T10:15:42.9553003Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tD:/a/renterd/renterd/internal/test/e2e/cluster.go:911 +0x105\n"}
2024-08-29T10:15:42.9554850Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"go.sia.tech/renterd/internal/test/e2e.TestSlabBufferStats(0xc000017520)\n"}
2024-08-29T10:15:42.9556716Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tD:/a/renterd/renterd/internal/test/e2e/cluster_test.go:2029 +0x1987\n"}

For context, we changed the behaviour here a couple of times:

I think we should rename close to Shutdown and accept a context if we intend to wait for all goroutines to exit. That still leaves the fact that we deadlocked. Looking at all the spots, the only potential culprit I could identify was runPeer where we don't set a deadline on handleRPC. I considered setting a 5' timeout but landed on 10' to be generous.

peterjan · 2024-08-29T11:30:53Z

not assigning Luke because he's OOO

ChrisSchinnerl · 2024-08-29T11:54:30Z

syncer/syncer.go

@@ -285,14 +285,19 @@ func (s *Syncer) runPeer(p *Peer) error {
 			p.setErr(err)
 			return fmt.Errorf("failed to accept rpc: %w", err)
 		}
+
+		// set a generous deadline
+		err = stream.SetDeadline(time.Now().Add(10 * time.Minute))


This is not really a fix I'm afraid. If we ever run into this on shutdown, the deadlock still exists. We should figure out what is causing the actual deadlock since the stream should unblock after the listener is closed.

Other than this line I couldn't really identify another spot where it might potentially deadlock. @lukechampine do you mind taking a look? I commented on the PR with a link to the stack trace. I figured the stream could live on even after the listener got closed.

ChrisSchinnerl · 2024-08-29T11:55:21Z

syncer/syncer.go

-	s.wg.Wait()
+
+	waitChan := make(chan struct{})
+	go func() {


Similar thing here. This shouldn't really be necessary. If we run into this, we still got deadlock issues.

I definitely agree on the deadlock still being an issue, I wasn't aware the stream would automatically unblock if the listener got closed. That aside though, I always do the waitChan pattern to avoid a naked s.wg.Wait and consider it good practice?

peterjan · 2024-08-29T13:54:09Z

Full stack trace

…nto pj/syncer-shutdown

peterjan · 2024-09-16T13:07:31Z

@lukechampine I closed this out for now because I can't seem to reproduce the deadlock on the CI, have never encountered it locally. I used to see it once every other day in the wild but I haven't seen it for some time now. Perhaps we should consider switching strategies on this one and transform this PR into doing the small close->shutdown refactor and potentially add some hardening or slightly rewrite the concurrency mechanisms in the syncer. I informed Chris to keep an eye out if we do encounter it again, upon which time we'll have more information to go from. I really did see it at least 4 times but just haven't seen it since.

peterjan · 2024-09-26T07:11:03Z

Obviously it happens again two days after closing it

https://github.com/SiaFoundation/renterd/actions/runs/11030847304/job/30636343643?pr=1574

stacktrace.txt

syncer: rename Close to Shutdown

84396b0

peterjan self-assigned this Aug 29, 2024

peterjan marked this pull request as ready for review August 29, 2024 11:30

peterjan requested review from n8maninger and ChrisSchinnerl and removed request for n8maninger August 29, 2024 11:30

ChrisSchinnerl reviewed Aug 29, 2024

View reviewed changes

peterjan requested a review from lukechampine September 9, 2024 08:04

Merge branch 'master' of https://github.com/SiaFoundation/coreutils i…

2c7ca17

…nto pj/syncer-shutdown

ChrisSchinnerl closed this Sep 24, 2024

ChrisSchinnerl mentioned this pull request Sep 26, 2024

Fix deadlock in syncer #100

Open

peterjan deleted the pj/syncer-shutdown branch October 16, 2024 07:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid deadlock in syncer.Close #92

Avoid deadlock in syncer.Close #92

peterjan commented Aug 29, 2024 •

edited

Loading

peterjan commented Aug 29, 2024

ChrisSchinnerl Aug 29, 2024

peterjan Aug 29, 2024

ChrisSchinnerl Aug 29, 2024

peterjan Aug 29, 2024

peterjan commented Aug 29, 2024

peterjan commented Sep 16, 2024

peterjan commented Sep 26, 2024 •

edited

Loading

Avoid deadlock in syncer.Close #92

Avoid deadlock in syncer.Close #92

Conversation

peterjan commented Aug 29, 2024 • edited Loading

peterjan commented Aug 29, 2024

ChrisSchinnerl Aug 29, 2024

Choose a reason for hiding this comment

peterjan Aug 29, 2024

Choose a reason for hiding this comment

ChrisSchinnerl Aug 29, 2024

Choose a reason for hiding this comment

peterjan Aug 29, 2024

Choose a reason for hiding this comment

peterjan commented Aug 29, 2024

peterjan commented Sep 16, 2024

peterjan commented Sep 26, 2024 • edited Loading

peterjan commented Aug 29, 2024 •

edited

Loading

peterjan commented Sep 26, 2024 •

edited

Loading