Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in Fail-back Group #334

Open
er1cst opened this issue Sep 9, 2023 · 1 comment
Open

Deadlock in Fail-back Group #334

er1cst opened this issue Sep 9, 2023 · 1 comment

Comments

@er1cst
Copy link
Contributor

er1cst commented Sep 9, 2023

In recent days, I found RouteDNS suddenly stopped responding to any requests randomly until I restarted the service.

My configuration is quite simple: a cache and a router in front of a fail-back group switching between several DoH and DoT resolvers:

[listeners.local-tcp]
address = "0.0.0.0:53"
protocol = "tcp"
resolver = "cache"

[listeners.local-udp]
address = "0.0.0.0:53"
protocol = "tcp"
resolver = "cache"

[groups.cache]
type = "cache"
resolvers = ["router1"]
backend = { type = "memory", size = 10000 }

[routers.router1]
routes = [
  # ...
  # other routes are ignored here
  { resolver = 'fail-back' },
]

[groups.fail-back]
resolvers = ["cloudflare-doh", "quad9-dot", "google-dot", "cloudflare-dot"]
type = "fail-back"
servfail-error = true

# ...

In the next few days, I found it more likely to happened when:

  • there're many incoming requests at that time, and
  • upstream resolvers failed immediately (for example conection refused)

To find out what happend to the program, I added the pprof endpoints and finally found out why. As the title says, there may be a deadlock in the fail-back mechanism.

Here's an excerpt of pprof/goroutines:

goroutine profile: total 414
245 @ 0x56ba8 0x6b480 0x6b451 0x8c9e0 0x9c820 0x62f690 0x62f270 0x64056c 0x61e5a0 0x62172c 0x467c48 0x46a914 0x46a1f8 0x8fc08
#	0x8c9df		sync.runtime_SemacquireRWMutexR+0x3b			runtime/sema.go:82
#	0x9c81f		sync.(*RWMutex).RLock+0x57				sync/rwmutex.go:71
#	0x62f68f	github.com/folbricht/routedns.(*FailBack).current+0x4f	github.com/folbricht/routedns/failback.go:104
#	0x62f26f	github.com/folbricht/routedns.(*FailBack).Resolve+0x143	github.com/folbricht/routedns/failback.go:83
#	0x64056b	github.com/folbricht/routedns.(*Router).Resolve+0x32b	github.com/folbricht/routedns/router.go:65
#	0x61e59f	github.com/folbricht/routedns.(*Cache).Resolve+0x59f	github.com/folbricht/routedns/cache.go:197
#	0x62172b	github.com/folbricht/routedns.listenHandler.func1+0x683	github.com/folbricht/routedns/dnslistener.go:80
#	0x467c47	github.com/miekg/dns.HandlerFunc.ServeDNS+0x33		github.com/miekg/[email protected]/server.go:37
#	0x46a913	github.com/miekg/dns.(*Server).serveDNS+0x44f		github.com/miekg/[email protected]/server.go:659
#	0x46a1f7	github.com/miekg/dns.(*Server).serveTCPConn+0x31b	github.com/miekg/[email protected]/server.go:571

149 @ 0x56ba8 0x6b480 0x6b451 0x8c994 0x9b280 0x9af80 0x9c9c0 0x62f7c4 0x62f240 0x64056c 0x61e5a0 0x62172c 0x467c48 0x46a914 0x46a1f8 0x8fc08
#	0x8c993		sync.runtime_SemacquireMutex+0x3b				runtime/sema.go:77
#	0x9b27f		sync.(*Mutex).lockSlow+0x27f					sync/mutex.go:171
#	0x9af7f		sync.(*Mutex).Lock+0x4b						sync/mutex.go:90
#	0x9c9bf		sync.(*RWMutex).Lock+0x1f					sync/rwmutex.go:147
#	0x62f7c3	github.com/folbricht/routedns.(*FailBack).errorFrom+0x3b	github.com/folbricht/routedns/failback.go:114
#	0x62f23f	github.com/folbricht/routedns.(*FailBack).Resolve+0x113		github.com/folbricht/routedns/failback.go:93
#	0x64056b	github.com/folbricht/routedns.(*Router).Resolve+0x32b		github.com/folbricht/routedns/router.go:65
#	0x61e59f	github.com/folbricht/routedns.(*Cache).Resolve+0x59f		github.com/folbricht/routedns/cache.go:197
#	0x62172b	github.com/folbricht/routedns.listenHandler.func1+0x683		github.com/folbricht/routedns/dnslistener.go:80
#	0x467c47	github.com/miekg/dns.HandlerFunc.ServeDNS+0x33			github.com/miekg/[email protected]/server.go:37
#	0x46a913	github.com/miekg/dns.(*Server).serveDNS+0x44f			github.com/miekg/[email protected]/server.go:659
#	0x46a1f7	github.com/miekg/dns.(*Server).serveTCPConn+0x31b		github.com/miekg/[email protected]/server.go:571

1 @ 0x56ba8 0x18010 0x17c7c 0x62fab8 0x62f240 0x64056c 0x61e5a0 0x62172c 0x467c48 0x46a914 0x46a1f8 0x8fc08
#	0x62fab7	github.com/folbricht/routedns.(*FailBack).errorFrom+0x32f	github.com/folbricht/routedns/failback.go:129
#	0x62f23f	github.com/folbricht/routedns.(*FailBack).Resolve+0x113		github.com/folbricht/routedns/failback.go:93
#	0x64056b	github.com/folbricht/routedns.(*Router).Resolve+0x32b		github.com/folbricht/routedns/router.go:65
#	0x61e59f	github.com/folbricht/routedns.(*Cache).Resolve+0x59f		github.com/folbricht/routedns/cache.go:197
#	0x62172b	github.com/folbricht/routedns.listenHandler.func1+0x683		github.com/folbricht/routedns/dnslistener.go:80
#	0x467c47	github.com/miekg/dns.HandlerFunc.ServeDNS+0x33			github.com/miekg/[email protected]/server.go:37
#	0x46a913	github.com/miekg/dns.(*Server).serveDNS+0x44f			github.com/miekg/[email protected]/server.go:659
#	0x46a1f7	github.com/miekg/dns.(*Server).serveTCPConn+0x31b		github.com/miekg/[email protected]/server.go:571

1 @ 0x56ba8 0x6b480 0x6b451 0x8c994 0x9b280 0x9af80 0x9c9c0 0x62fd48 0x8fc08
#	0x8c993		sync.runtime_SemacquireMutex+0x3b					runtime/sema.go:77
#	0x9b27f		sync.(*Mutex).lockSlow+0x27f						sync/mutex.go:171
#	0x9af7f		sync.(*Mutex).Lock+0x4b							sync/mutex.go:90
#	0x9c9bf		sync.(*RWMutex).Lock+0x1f						sync/rwmutex.go:147
#	0x62fd47	github.com/folbricht/routedns.(*FailBack).startResetTimer.func1+0x117	github.com/folbricht/routedns/failback.go:145

As you can see,

  1. a goroutine (lets call it goroutine 1) calling errorFrom has aquired the lock and is trying to write to failCh
  2. the goroutine (goroutine 2) started by startResetTimer() is trying to aquire a lock for writing
  3. other goroutines are trying to aquire a read lock

Goroutine 2 is the only routine that consumes failCh, but it has no chance until goroutine 1 releases the write lock. However, goroutine 1 stucks in failCh with the lock aquired. Finally, the deadlock occurs.

To fix this, we can simply release the lock before writing to failCh:

func (r *FailBack) errorFrom(i int) {
	r.mu.Lock()
-	defer r.mu.Unlock()
	if i != r.active {
+		r.mu.Unlock()
		return
	}
	if r.failCh == nil { // lazy start the reset timer
		r.failCh = r.startResetTimer()
	}
	r.active = (r.active + 1) % len(r.resolvers)
	Log.WithFields(logrus.Fields{
		"id":       r.id,
		"resolver": r.resolvers[r.active].String(),
	}).Debug("failing over to resolver")
+	r.mu.Unlock()
	r.metrics.failover.Add(1)
	r.metrics.available.Add(-1)
	r.failCh <- struct{}{} // signal the timer to wait some more before switching back
}
@folbricht
Copy link
Owner

Thank you for debugging this. The failure mode isn't obvious. This implementation is probably more complex that it needs to be and could do with a rewrite. Perhaps in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants