zetaclient stops connecting to external chain RPCs after receiving HTTP error codes. #3328

CharlieMc0 · 2024-12-30T01:10:18Z

Describe the Bug
When zetaclientd gets an HTTP error back from the BTC RPC it stops working and never retries until zetaclientd is restarted.

To Reproduce
Would need to replicate the the 409 or other HTTP errors codes from the BTC RPC. It isn't clear if this is limited to BTC or an issue for the RPC connectivity for all networks.

Expected Behavior
Ideally all attempts to connect to RPCs for external networks should have a backoff timer and continue to retry even if they receive errors from the external RPC server.

Screenshots
If applicable, add screenshots to help explain your problem.

You can see from the logs one one validator was falling behind on BTC blocks

gartnera · 2025-01-02T18:39:58Z

Both of the places were FetchUTXOs is called do not exit if error. So if this is really happening then the log messages are probably unrelated.

What we really need in cases like this is a goroutine dump via curl -o goroutine.dump http://localhost:6061/debug/pprof/goroutine?debug=2

My best guess is that one of the RPCs just hung. The bitcoin rpc client doesn't take contexts or have any configurable timeouts on calls.

node/zetaclient/chains/bitcoin/observer/observer.go

Lines 308 to 319 in 80ca921

    
           err := ob.FetchUTXOs(ctx) 
        
           if err != nil { 
        
           	// log debug log if the error if no wallet is loaded 
        
           	// this is to prevent extensive logging in localnet when the wallet is not loaded for non-Bitcoin test 
        
           	// TODO: prevent this routine from running if Bitcoin node is not enabled 
        
           	// https://github.com/zeta-chain/node/issues/2790 
        
           	if !strings.Contains(err.Error(), "No wallet is loaded") { 
        
           		ob.logger.UTXOs.Error().Err(err).Msg("error fetching btc utxos") 
        
           	} else { 
        
           		ob.logger.UTXOs.Debug().Err(err).Msg("No wallet is loaded") 
        
           	} 
        
           }

node/zetaclient/chains/bitcoin/signer/signer.go

Lines 206 to 213 in 80ca921

    
           // refresh unspent UTXOs and continue with keysign regardless of error 
        
           err := observer.FetchUTXOs(ctx) 
        
           if err != nil { 
        
           	signer.Logger(). 
        
           		Std.Error(). 
        
           		Err(err). 
        
           		Msgf("SignGasWithdraw: FetchUTXOs error: nonce %d chain %d", nonce, chain.ChainId) 
        
           }

ws4charlie · 2025-01-06T16:35:29Z

Seems we only kept 7 days of logs in Datadog and above log prints are not tracked any more. After searching above log prints in code, the initial error was on Bitcoin RPC GetBlockCount (error getting block height), and then it failed FetchUTXOs. When FetchUTXOs fails, zetaclient won't be able to get balance and outbound will not work properly.

@gartnera As you said, the Bitcoin RPCs are designed to NOT take a context, so they should be able to resume/reconnect by themselves. The root cause is not looking clear with above logs.

ws4charlie · 2025-01-06T21:26:26Z

The whole process was:

We have a watchRPCStatus goroutine that NEVER stops for Bitcoin chain.
Every mininute, this goroutine either print RPC latency is OK or an error message.
The last heart beat of validator1-us-east-1-mainnet was on 16:20:43 as shown in below screenshot.
The heart beat resumed after restarting zetaclient 18:58:33.

Here are my guesses after investigation into logs.

The watchRPCStatus go-routine got stuck on this call blockTime, err := rpc.CheckRPCStatus(ob.btcClient, tssAddress) for 2.5 hours.
The ObserveInbound go-routine also got stuck, for similar reason, on this line err := ob.ObserveInbound(ctx)

The reason is that both go-routines either NEVER work silently, they print message on every iteration. To be specific,
watchRPCStatus prints RPC latency is OK or error;
ObserveInbound prints observeInboundBTC: block XXX has YYY txs
We see no logs like above after last heart beat.

On which lines the above two go-routines got stuck?
A go routine stops execution only if it is blocked by a lock or IO. There is no lock used in above go-routines (no crash as well) so the IO is only what we can suspect.

In our case, the RPC status check got stuck on one of the following RPC calls in function CheckRPCStatus
GetBlockCount, GetBlockHash, GetBlockHeader, ListUnspentMinMaxAddresses.

The inbound go routine got stuck on one of the two lines (not sure which one):

currentBlock, err := ob.btcClient.GetBlockCount() at line 76 that invokes GetBlockCount.
res, err := ob.GetBlockByNumberCached(int64(blockNumber)) at line 108, which invokes GetBlockHash, GetBlockHeader, GetBlockVerboseTx internally.

Restarting the program solved the RPC errors was a interesting hint. The 409 error indicates a request conflict with the current state of the target resource and was resolved by a restart?

The link to Grafana logs

CharlieMc0 · 2025-01-06T21:47:06Z

What's a good way to avoid this in the future? We need to ensure zetaclient can handle unusual responses from RPCs without going offline.

Something like a RPC heartbeat fails to work after 5 minutes restart the go-routine for that network? If we can't handle the error directly when it happens.

ws4charlie · 2025-01-06T22:15:28Z

This feels to restarting the zetaclient reset/released a process-level resource (don't know what was it), therefor solved the RPC hung problem. The above issue was like (for instance):
zetaclient go routine calls GetBlockCount and the call just got stuck right on the line and never returned. So the program execution was unexpectedly hung (the Bitcoin RPCs were designed return, without a context). A remedy to this would be external and operational in stead of self diagnosis inside zetaclient.

We would leave this issue under monitoring for a while before taking effective actions so we can collect and identify some patterns/clues of the errors.

gartnera · 2025-01-06T23:09:46Z

What's a good way to avoid this in the future? We need to ensure zetaclient can handle unusual responses from RPCs without going offline.

Completely reimplement the bitcoin rpc such that it takes contexts and properly returns on timeout/context cancelation.
Add an in process watchdog that must be bumped every N seconds. If it is not bumped the process would automatically restart.

Medium to high effort

swift1337 · 2025-01-07T11:36:13Z

Completely reimplement the Bitcoin RPC such that it takes contexts and properly returns on timeout/context cancellation.

I agree, this is the only reliable way of solving this.

We need to write a wrapper around github.com/btcsuite/btcd/btcjson that uses the same DTOs but has proper HTTP & ctx support. Only used methods should be implemented, no need to code all supported RPC methods.

We can also leverage this moment to bring proper retries, logs, and Prometheus metrics.

UPD: Inspected btcjson, it shouldn't be hard to implement our own client.

btcjson.MarshalCmd(...) marshals types into {"jsonrpc":"1.0","method":"...","params":["..."],"id":1} rpc calls

ws4charlie · 2025-01-07T19:11:43Z

@gartnera I agree with the short/medium term to setup watchdog to automatically restart. Let's monitor how frequently the issue would happen moving forward and find some patterns. If necessary, re-implementing (or fork and add context) Bitcoin RPC is always not a bad idea if RPC calls hang a lot.

CharlieMc0 added bug Something isn't working zetaclient Issues related to ZetaClient labels Dec 30, 2024

CharlieMc0 added this to the Internal Improvements (v25) milestone Dec 30, 2024

lumtis modified the milestones: Internal Improvements (v25), Universal Staking Release (planned v26) Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zetaclient stops connecting to external chain RPCs after receiving HTTP error codes. #3328

zetaclient stops connecting to external chain RPCs after receiving HTTP error codes. #3328

CharlieMc0 commented Dec 30, 2024 •

edited

Loading

gartnera commented Jan 2, 2025 •

edited

Loading

ws4charlie commented Jan 6, 2025

ws4charlie commented Jan 6, 2025 •

edited

Loading

CharlieMc0 commented Jan 6, 2025 •

edited

Loading

ws4charlie commented Jan 6, 2025 •

edited

Loading

gartnera commented Jan 6, 2025 •

edited

Loading

swift1337 commented Jan 7, 2025 •

edited

Loading

ws4charlie commented Jan 7, 2025

zetaclient stops connecting to external chain RPCs after receiving HTTP error codes. #3328

zetaclient stops connecting to external chain RPCs after receiving HTTP error codes. #3328

Comments

CharlieMc0 commented Dec 30, 2024 • edited Loading

gartnera commented Jan 2, 2025 • edited Loading

ws4charlie commented Jan 6, 2025

ws4charlie commented Jan 6, 2025 • edited Loading

CharlieMc0 commented Jan 6, 2025 • edited Loading

ws4charlie commented Jan 6, 2025 • edited Loading

gartnera commented Jan 6, 2025 • edited Loading

swift1337 commented Jan 7, 2025 • edited Loading

ws4charlie commented Jan 7, 2025

CharlieMc0 commented Dec 30, 2024 •

edited

Loading

gartnera commented Jan 2, 2025 •

edited

Loading

ws4charlie commented Jan 6, 2025 •

edited

Loading

CharlieMc0 commented Jan 6, 2025 •

edited

Loading

ws4charlie commented Jan 6, 2025 •

edited

Loading

gartnera commented Jan 6, 2025 •

edited

Loading

swift1337 commented Jan 7, 2025 •

edited

Loading