-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zetaclient stops connecting to external chain RPCs after receiving HTTP error codes. #3328
Comments
Both of the places were What we really need in cases like this is a goroutine dump via My best guess is that one of the RPCs just hung. The bitcoin rpc client doesn't take contexts or have any configurable timeouts on calls. node/zetaclient/chains/bitcoin/observer/observer.go Lines 308 to 319 in 80ca921
node/zetaclient/chains/bitcoin/signer/signer.go Lines 206 to 213 in 80ca921
|
Seems we only kept 7 days of logs in Datadog and above log prints are not tracked any more. After searching above log prints in code, the initial error was on Bitcoin RPC @gartnera As you said, the Bitcoin RPCs are designed to NOT take a context, so they should be able to resume/reconnect by themselves. The root cause is not looking clear with above logs. |
The whole process was:
Here are my guesses after investigation into logs.
The reason is that both go-routines either NEVER work silently, they print message on every iteration. To be specific, On which lines the above two go-routines got stuck? In our case, the RPC status check got stuck on one of the following RPC calls in function CheckRPCStatus The inbound go routine got stuck on one of the two lines (not sure which one):
Restarting the program solved the RPC errors was a interesting hint. The The link to Grafana logs |
What's a good way to avoid this in the future? We need to ensure zetaclient can handle unusual responses from RPCs without going offline. Something like a RPC heartbeat fails to work after 5 minutes restart the go-routine for that network? If we can't handle the error directly when it happens. |
This feels to restarting the zetaclient reset/released a process-level resource (don't know what was it), therefor solved the RPC hung problem. The above issue was like (for instance): We would leave this issue under monitoring for a while before taking effective actions so we can collect and identify some patterns/clues of the errors. |
Medium to high effort |
I agree, this is the only reliable way of solving this. We need to write a wrapper around We can also leverage this moment to bring proper retries, logs, and Prometheus metrics. UPD: Inspected
|
@gartnera I agree with the short/medium term to setup watchdog to automatically restart. Let's monitor how frequently the issue would happen moving forward and find some patterns. If necessary, re-implementing (or fork and add context) Bitcoin RPC is always not a bad idea if RPC calls hang a lot. |
Describe the Bug
When zetaclientd gets an HTTP error back from the BTC RPC it stops working and never retries until zetaclientd is restarted.
To Reproduce
Would need to replicate the the 409 or other HTTP errors codes from the BTC RPC. It isn't clear if this is limited to BTC or an issue for the RPC connectivity for all networks.
Expected Behavior
Ideally all attempts to connect to RPCs for external networks should have a backoff timer and continue to retry even if they receive errors from the external RPC server.
Screenshots
If applicable, add screenshots to help explain your problem.
You can see from the logs one one validator was falling behind on BTC blocks
The text was updated successfully, but these errors were encountered: