Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deadlock in syncer.Close #92

Closed
wants to merge 2 commits into from
Closed

Conversation

peterjan
Copy link
Member

@peterjan peterjan commented Aug 29, 2024

I ran into the following deadlock:

2024-08-29T10:15:42.9535023Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"goroutine 1136792 [semacquire, 56 minutes]:\n"}
2024-08-29T10:15:42.9536609Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"sync.runtime_Semacquire(0xc000e7b9b8?)\n"}
2024-08-29T10:15:42.9538434Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tC:/hostedtoolcache/windows/go/1.22.6/x64/src/runtime/sema.go:62 +0x25\n"}
2024-08-29T10:15:42.9540131Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"sync.(*WaitGroup).Wait(0xc000e7b9b0)\n"}
2024-08-29T10:15:42.9541860Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tC:/hostedtoolcache/windows/go/1.22.6/x64/src/sync/waitgroup.go:116 +0xa8\n"}
2024-08-29T10:15:42.9543649Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"go.sia.tech/coreutils/syncer.(*Syncer).Close(0xc000e7b8c0)\n"}
2024-08-29T10:15:42.9545542Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tC:/Users/runneradmin/go/pkg/mod/go.sia.tech/[email protected]/syncer/syncer.go:681 +0x59\n"}
2024-08-29T10:15:42.9547515Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"go.sia.tech/renterd/internal/test/e2e.(*Host).Close(0xc0001114a0)\n"}
2024-08-29T10:15:42.9549330Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tD:/a/renterd/renterd/internal/test/e2e/host.go:162 +0x1f5\n"}
2024-08-29T10:15:42.9551160Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"go.sia.tech/renterd/internal/test/e2e.(*TestCluster).Shutdown(0xc0009addc0)\n"}
2024-08-29T10:15:42.9553003Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tD:/a/renterd/renterd/internal/test/e2e/cluster.go:911 +0x105\n"}
2024-08-29T10:15:42.9554850Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"go.sia.tech/renterd/internal/test/e2e.TestSlabBufferStats(0xc000017520)\n"}
2024-08-29T10:15:42.9556716Z {"Time":"2024-08-29T10:15:42.5691542Z","Action":"output","Package":"go.sia.tech/renterd/internal/test/e2e","Test":"TestSlabBufferStats","Output":"\tD:/a/renterd/renterd/internal/test/e2e/cluster_test.go:2029 +0x1987\n"}

For context, we changed the behaviour here a couple of times:

I think we should rename close to Shutdown and accept a context if we intend to wait for all goroutines to exit. That still leaves the fact that we deadlocked. Looking at all the spots, the only potential culprit I could identify was runPeer where we don't set a deadline on handleRPC. I considered setting a 5' timeout but landed on 10' to be generous.

@peterjan peterjan self-assigned this Aug 29, 2024
@peterjan peterjan marked this pull request as ready for review August 29, 2024 11:30
@peterjan peterjan requested review from n8maninger and ChrisSchinnerl and removed request for n8maninger August 29, 2024 11:30
@peterjan
Copy link
Member Author

not assigning Luke because he's OOO

@@ -285,14 +285,19 @@ func (s *Syncer) runPeer(p *Peer) error {
p.setErr(err)
return fmt.Errorf("failed to accept rpc: %w", err)
}

// set a generous deadline
err = stream.SetDeadline(time.Now().Add(10 * time.Minute))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really a fix I'm afraid. If we ever run into this on shutdown, the deadlock still exists. We should figure out what is causing the actual deadlock since the stream should unblock after the listener is closed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than this line I couldn't really identify another spot where it might potentially deadlock. @lukechampine do you mind taking a look? I commented on the PR with a link to the stack trace. I figured the stream could live on even after the listener got closed.

s.wg.Wait()

waitChan := make(chan struct{})
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar thing here. This shouldn't really be necessary. If we run into this, we still got deadlock issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree on the deadlock still being an issue, I wasn't aware the stream would automatically unblock if the listener got closed. That aside though, I always do the waitChan pattern to avoid a naked s.wg.Wait and consider it good practice?

@peterjan
Copy link
Member Author

Full stack trace

@peterjan
Copy link
Member Author

@lukechampine I closed this out for now because I can't seem to reproduce the deadlock on the CI, have never encountered it locally. I used to see it once every other day in the wild but I haven't seen it for some time now. Perhaps we should consider switching strategies on this one and transform this PR into doing the small close->shutdown refactor and potentially add some hardening or slightly rewrite the concurrency mechanisms in the syncer. I informed Chris to keep an eye out if we do encounter it again, upon which time we'll have more information to go from. I really did see it at least 4 times but just haven't seen it since.

@peterjan
Copy link
Member Author

peterjan commented Sep 26, 2024

Obviously it happens again two days after closing it

https://github.com/SiaFoundation/renterd/actions/runs/11030847304/job/30636343643?pr=1574

stacktrace.txt

@peterjan peterjan deleted the pj/syncer-shutdown branch October 16, 2024 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants