Deadlock in Fail-back Group #334

er1cst · 2023-09-09T04:36:04Z

In recent days, I found RouteDNS suddenly stopped responding to any requests randomly until I restarted the service.

My configuration is quite simple: a cache and a router in front of a fail-back group switching between several DoH and DoT resolvers:

[listeners.local-tcp]
address = "0.0.0.0:53"
protocol = "tcp"
resolver = "cache"

[listeners.local-udp]
address = "0.0.0.0:53"
protocol = "tcp"
resolver = "cache"

[groups.cache]
type = "cache"
resolvers = ["router1"]
backend = { type = "memory", size = 10000 }

[routers.router1]
routes = [
  # ...
  # other routes are ignored here
  { resolver = 'fail-back' },
]

[groups.fail-back]
resolvers = ["cloudflare-doh", "quad9-dot", "google-dot", "cloudflare-dot"]
type = "fail-back"
servfail-error = true

# ...

In the next few days, I found it more likely to happened when:

there're many incoming requests at that time, and
upstream resolvers failed immediately (for example conection refused)

To find out what happend to the program, I added the pprof endpoints and finally found out why. As the title says, there may be a deadlock in the fail-back mechanism.

Here's an excerpt of pprof/goroutines:

goroutine profile: total 414
245 @ 0x56ba8 0x6b480 0x6b451 0x8c9e0 0x9c820 0x62f690 0x62f270 0x64056c 0x61e5a0 0x62172c 0x467c48 0x46a914 0x46a1f8 0x8fc08
#	0x8c9df		sync.runtime_SemacquireRWMutexR+0x3b			runtime/sema.go:82
#	0x9c81f		sync.(*RWMutex).RLock+0x57				sync/rwmutex.go:71
#	0x62f68f	github.com/folbricht/routedns.(*FailBack).current+0x4f	github.com/folbricht/routedns/failback.go:104
#	0x62f26f	github.com/folbricht/routedns.(*FailBack).Resolve+0x143	github.com/folbricht/routedns/failback.go:83
#	0x64056b	github.com/folbricht/routedns.(*Router).Resolve+0x32b	github.com/folbricht/routedns/router.go:65
#	0x61e59f	github.com/folbricht/routedns.(*Cache).Resolve+0x59f	github.com/folbricht/routedns/cache.go:197
#	0x62172b	github.com/folbricht/routedns.listenHandler.func1+0x683	github.com/folbricht/routedns/dnslistener.go:80
#	0x467c47	github.com/miekg/dns.HandlerFunc.ServeDNS+0x33		github.com/miekg/[email protected]/server.go:37
#	0x46a913	github.com/miekg/dns.(*Server).serveDNS+0x44f		github.com/miekg/[email protected]/server.go:659
#	0x46a1f7	github.com/miekg/dns.(*Server).serveTCPConn+0x31b	github.com/miekg/[email protected]/server.go:571

149 @ 0x56ba8 0x6b480 0x6b451 0x8c994 0x9b280 0x9af80 0x9c9c0 0x62f7c4 0x62f240 0x64056c 0x61e5a0 0x62172c 0x467c48 0x46a914 0x46a1f8 0x8fc08
#	0x8c993		sync.runtime_SemacquireMutex+0x3b				runtime/sema.go:77
#	0x9b27f		sync.(*Mutex).lockSlow+0x27f					sync/mutex.go:171
#	0x9af7f		sync.(*Mutex).Lock+0x4b						sync/mutex.go:90
#	0x9c9bf		sync.(*RWMutex).Lock+0x1f					sync/rwmutex.go:147
#	0x62f7c3	github.com/folbricht/routedns.(*FailBack).errorFrom+0x3b	github.com/folbricht/routedns/failback.go:114
#	0x62f23f	github.com/folbricht/routedns.(*FailBack).Resolve+0x113		github.com/folbricht/routedns/failback.go:93
#	0x64056b	github.com/folbricht/routedns.(*Router).Resolve+0x32b		github.com/folbricht/routedns/router.go:65
#	0x61e59f	github.com/folbricht/routedns.(*Cache).Resolve+0x59f		github.com/folbricht/routedns/cache.go:197
#	0x62172b	github.com/folbricht/routedns.listenHandler.func1+0x683		github.com/folbricht/routedns/dnslistener.go:80
#	0x467c47	github.com/miekg/dns.HandlerFunc.ServeDNS+0x33			github.com/miekg/[email protected]/server.go:37
#	0x46a913	github.com/miekg/dns.(*Server).serveDNS+0x44f			github.com/miekg/[email protected]/server.go:659
#	0x46a1f7	github.com/miekg/dns.(*Server).serveTCPConn+0x31b		github.com/miekg/[email protected]/server.go:571

1 @ 0x56ba8 0x18010 0x17c7c 0x62fab8 0x62f240 0x64056c 0x61e5a0 0x62172c 0x467c48 0x46a914 0x46a1f8 0x8fc08
#	0x62fab7	github.com/folbricht/routedns.(*FailBack).errorFrom+0x32f	github.com/folbricht/routedns/failback.go:129
#	0x62f23f	github.com/folbricht/routedns.(*FailBack).Resolve+0x113		github.com/folbricht/routedns/failback.go:93
#	0x64056b	github.com/folbricht/routedns.(*Router).Resolve+0x32b		github.com/folbricht/routedns/router.go:65
#	0x61e59f	github.com/folbricht/routedns.(*Cache).Resolve+0x59f		github.com/folbricht/routedns/cache.go:197
#	0x62172b	github.com/folbricht/routedns.listenHandler.func1+0x683		github.com/folbricht/routedns/dnslistener.go:80
#	0x467c47	github.com/miekg/dns.HandlerFunc.ServeDNS+0x33			github.com/miekg/[email protected]/server.go:37
#	0x46a913	github.com/miekg/dns.(*Server).serveDNS+0x44f			github.com/miekg/[email protected]/server.go:659
#	0x46a1f7	github.com/miekg/dns.(*Server).serveTCPConn+0x31b		github.com/miekg/[email protected]/server.go:571

1 @ 0x56ba8 0x6b480 0x6b451 0x8c994 0x9b280 0x9af80 0x9c9c0 0x62fd48 0x8fc08
#	0x8c993		sync.runtime_SemacquireMutex+0x3b					runtime/sema.go:77
#	0x9b27f		sync.(*Mutex).lockSlow+0x27f						sync/mutex.go:171
#	0x9af7f		sync.(*Mutex).Lock+0x4b							sync/mutex.go:90
#	0x9c9bf		sync.(*RWMutex).Lock+0x1f						sync/rwmutex.go:147
#	0x62fd47	github.com/folbricht/routedns.(*FailBack).startResetTimer.func1+0x117	github.com/folbricht/routedns/failback.go:145

As you can see,

a goroutine (lets call it goroutine 1) calling errorFrom has aquired the lock and is trying to write to failCh
the goroutine (goroutine 2) started by startResetTimer() is trying to aquire a lock for writing
other goroutines are trying to aquire a read lock

Goroutine 2 is the only routine that consumes failCh, but it has no chance until goroutine 1 releases the write lock. However, goroutine 1 stucks in failCh with the lock aquired. Finally, the deadlock occurs.

To fix this, we can simply release the lock before writing to failCh:

func (r *FailBack) errorFrom(i int) {
	r.mu.Lock()
-	defer r.mu.Unlock()
	if i != r.active {
+		r.mu.Unlock()
		return
	}
	if r.failCh == nil { // lazy start the reset timer
		r.failCh = r.startResetTimer()
	}
	r.active = (r.active + 1) % len(r.resolvers)
	Log.WithFields(logrus.Fields{
		"id":       r.id,
		"resolver": r.resolvers[r.active].String(),
	}).Debug("failing over to resolver")
+	r.mu.Unlock()
	r.metrics.failover.Add(1)
	r.metrics.available.Add(-1)
	r.failCh <- struct{}{} // signal the timer to wait some more before switching back
}

The text was updated successfully, but these errors were encountered:

folbricht · 2023-09-09T09:06:21Z

Thank you for debugging this. The failure mode isn't obvious. This implementation is probably more complex that it needs to be and could do with a rewrite. Perhaps in the future.

er1cst mentioned this issue Sep 9, 2023

Fix deadlock in fail-back group #335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in Fail-back Group #334

Deadlock in Fail-back Group #334

er1cst commented Sep 9, 2023 •

edited

Loading

folbricht commented Sep 9, 2023

Deadlock in Fail-back Group #334

Deadlock in Fail-back Group #334

Comments

er1cst commented Sep 9, 2023 • edited Loading

folbricht commented Sep 9, 2023

er1cst commented Sep 9, 2023 •

edited

Loading