You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the next few days, I found it more likely to happened when:
there're many incoming requests at that time, and
upstream resolvers failed immediately (for example conection refused)
To find out what happend to the program, I added the pprof endpoints and finally found out why. As the title says, there may be a deadlock in the fail-back mechanism.
a goroutine (lets call it goroutine 1) calling errorFrom has aquired the lock and is trying to write to failCh
the goroutine (goroutine 2) started by startResetTimer() is trying to aquire a lock for writing
other goroutines are trying to aquire a read lock
Goroutine 2 is the only routine that consumes failCh, but it has no chance until goroutine 1 releases the write lock. However, goroutine 1 stucks in failCh with the lock aquired. Finally, the deadlock occurs.
To fix this, we can simply release the lock before writing to failCh:
func (r *FailBack) errorFrom(i int) {
r.mu.Lock()
- defer r.mu.Unlock()
if i != r.active {
+ r.mu.Unlock()
return
}
if r.failCh == nil { // lazy start the reset timer
r.failCh = r.startResetTimer()
}
r.active = (r.active + 1) % len(r.resolvers)
Log.WithFields(logrus.Fields{
"id": r.id,
"resolver": r.resolvers[r.active].String(),
}).Debug("failing over to resolver")
+ r.mu.Unlock()
r.metrics.failover.Add(1)
r.metrics.available.Add(-1)
r.failCh <- struct{}{} // signal the timer to wait some more before switching back
}
The text was updated successfully, but these errors were encountered:
Thank you for debugging this. The failure mode isn't obvious. This implementation is probably more complex that it needs to be and could do with a rewrite. Perhaps in the future.
In recent days, I found RouteDNS suddenly stopped responding to any requests randomly until I restarted the service.
My configuration is quite simple: a cache and a router in front of a fail-back group switching between several DoH and DoT resolvers:
In the next few days, I found it more likely to happened when:
conection refused
)To find out what happend to the program, I added the
pprof
endpoints and finally found out why. As the title says, there may be a deadlock in the fail-back mechanism.Here's an excerpt of
pprof/goroutines
:As you can see,
errorFrom
has aquired the lock and is trying to write tofailCh
startResetTimer()
is trying to aquire a lock for writingGoroutine 2 is the only routine that consumes
failCh
, but it has no chance until goroutine 1 releases the write lock. However, goroutine 1 stucks infailCh
with the lock aquired. Finally, the deadlock occurs.To fix this, we can simply release the lock before writing to
failCh
:The text was updated successfully, but these errors were encountered: