Allow setting back-up ANs for networking resilience #890

m-Peter · 2025-09-26T13:25:22Z

Closes: #764

Description

I tried this locally, by setting up 2 Flow Emulator processes, on different ports:
1st process:

flow emulator -v

2nd process:

flow emulator -v --port=3599 --rest-port=9999 --admin-port=9090 --debugger-port=3456

Then I terminated the 1st process, which was configured to be the main AN (AccessNodeHost).
The EVM Gateway continued operating normally, by directing requests to AccessNodeBackupHosts, the 2nd process.

This was mainly achieved with these 2 gRPC options:

grpcOpts.WithResolvers(mr),
grpcOpts.WithDefaultServiceConfig(json)

We make use of the pick_first built-in client-side load balancer, and we still keep our custom retryInterceptor, which will first retry erroneous requests to the same AN, before letting the load balancer pick the first available.

For contributor use:

Targeted PR against master branch
Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
Code follows the standards mentioned here.
Updated relevant documentation
Re-reviewed Files changed in the Github PR explorer
Added appropriate labels

Summary by CodeRabbit

New Features
- New CLI flag and config field to specify multiple backup Access Node hosts; client can use resolver-based endpoints for failover.
Bug Fixes
- Refined retry/backoff behavior and retryable error handling; renamed timing constants; expanded transient-error handling and replaced single resubscribe attempt with a retry-until-success-or-timeout loop.
Tests
- Added integration test for Access Node backup failover; test harness now accepts an explicit server configuration.

coderabbitai · 2025-09-26T13:25:29Z

Walkthrough

Adds support for dialing Access Nodes via a manual gRPC resolver (primary + backups with pick_first LB), updates retry/interceptor constants and behavior, makes subscriber reconnects retry with a 30s window, adds CLI/config for backup hosts, and refactors tests to accept an explicit server config plus a new integration test for AN failover.

Changes

Cohort / File(s)	Summary
Bootstrap: gRPC resolver & retry renames `bootstrap/bootstrap.go`	Adds resolver-based dial path when `AccessNodeBackupHosts` present (manual resolver + `pick_first` service config, endpoints = primary + backups); retains legacy single-host dial otherwise. Replaces ResourceExhausted-specific retry constants with `DefaultRetryDelay` and `DefaultMaxRetryDelay`. Updates imports for resolver packages.
Config & CLI `config/config.go`, `cmd/run/cmd.go`	Adds `AccessNodeBackupHosts []string` to `Config` and `--access-node-backup-hosts` CLI flag; parses comma-separated backup hosts into `cfg.AccessNodeBackupHosts`.
Subscriber reconnect behavior `services/ingestion/event_subscriber.go`	Treats `codes.Unavailable` like transient errors (DeadlineExceeded/Internal); replaces a single resubscribe attempt with a retry loop that repeatedly calls `connect(lastReceivedHeight)` with 200ms waits up to a 30s timeout before returning `BlockEventsError`.
Test helpers refactor `tests/helpers.go`	Adds `defaultServerConfig() server.Config`; changes `startEmulator(createTestAccounts bool)` → `startEmulator(createTestAccounts bool, conf server.Config)` and uses provided config for emulator startup.
Integration test: AN failover `tests/integration_test.go`	Adds `Test_AccessNodeBackupFunctionality` to start primary + backup emulators, boot gateway with `AccessNodeBackupHosts`, and verify client `SyncProgress` served by primary then by backup after primary stops. Updates other tests to pass `defaultServerConfig()`.
Other tests updates `tests/key_store_release_test.go`, `tests/tx_batching_test.go`	Update calls to `startEmulator` to include `defaultServerConfig()`; no behavioral changes beyond using explicit config.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant App as Gateway Bootstrap
  participant CFG as Config
  participant R as Manual Resolver
  participant GRPC as gRPC Client
  participant AN as Access Nodes

  Note over CFG: AccessNodeHost + optional AccessNodeBackupHosts

  App->>CFG: Read AccessNodeHost, BackupHosts
  alt BackupHosts provided
    App->>R: Build endpoints list (primary + backups)
    R-->>GRPC: Provide resolver InitialState (endpoints)
    App->>GRPC: Dial with resolver + service config (pick_first) + retry interceptor
    GRPC->>AN: Connect via LB across endpoints
  else No backups
    App->>GRPC: Dial single host (legacy path)
    GRPC->>AN: Connect to primary only
  end

sequenceDiagram
  autonumber
  participant Sub as RPCEventSubscriber
  participant AN as Access Node
  participant Timer as Timer/Backoff

  Sub->>AN: Subscribe / Stream
  AN--xSub: Error (DeadlineExceeded / Internal / Unavailable / NotFound)
  alt NotFound
    Sub->>Timer: Short wait
    Timer-->>Sub: Retry subscribe
  else Transient error
    loop Retry loop (up to 30s)
      Sub->>Timer: Wait 200ms
      Timer-->>Sub: Retry connect(lastReceivedHeight)
      Sub->>AN: Attempt reconnect/resubscribe
      AN-->>Sub: Success or continue loop
    end
    alt Timeout reached
      Sub->>Sub: Return BlockEventsError(lastHeight, lastErr)
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Run Go modernize tool to simplify code-base #844 — touches bootstrap/bootstrap.go and modifies retryInterceptor (overlaps with retry logic changes).
Petera/handle resourceexhaused ingestion #686 — modifies gRPC retry logic and retry-related constants (overlaps with constant renames and retry behavior).
Add reconnect logic to RPCEventSubscriber #856 — modifies services/ingestion/event_subscriber.go reconnect logic (overlaps with subscriber retry/refactor).

Suggested reviewers

zhangchiqing
janezpodhostnik
peterargue

Poem

"I map the hosts in tidy rows,
I pick the one where traffic goes.
When primary naps, backups hop in line,
I hop, reconnect — the streams align.
Thump-thump, the gateway hums — all fine! 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes Check	⚠️ Warning	The modifications in services/ingestion/event_subscriber.go introduce generic retry and reconnection logic for event subscription that is unrelated to the backup Access Node failover functionality described in Issue #764, representing an out-of-scope change.	Consider removing or splitting the event_subscriber retry enhancements into a separate pull request focused on event subscription resilience.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title clearly and concisely describes the primary change of adding backup Access Nodes to enhance networking resilience, accurately reflecting the core objective without extraneous details.
Linked Issues Check	✅ Passed	The changes introduce a configurable list of backup Access Nodes via a CLI flag and Config field, implement gRPC client logic with a resolver and retry interceptor to gracefully fail over to backups, and include integration tests that validate automatic switchover, fully satisfying the objectives of Issue #764.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch mpeter/access-nodes-network-resilience

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

m-Peter · 2025-09-29T11:05:41Z

services/ingestion/event_subscriber.go

 					// next block is finalized. just wait briefly and try again
 					time.Sleep(200 * time.Millisecond)
-				case codes.DeadlineExceeded, codes.Internal:
+				case codes.DeadlineExceeded, codes.Internal, codes.Unavailable:


During some local testing with having 2 Flow Emulator processes, when I killed the 1st process, which was configured as the main AN(AccessNodeHost), then the EVM GW crashed with:

failure in event subscription with: recoverable: disconnected: error receiving event: rpc error: code = Unavailable desc = error reading from server: EOF

Adding the codes.Unavailable case, solved this issue.

I think Unavailable is different since it is unlikely that the node will suddenly be available on reconnect. I think if we receive Unavailable, it should failover to the new node. we could let it retry for some period first, but at the expense of delaying data.

I am not really sure how SubscribeEventsByBlockHeight is implemented under the hood, but I've observed that it doesn't go through the retryInterceptor, unlike other AN calls. For example:

[METHOD]: /flow.access.AccessAPI/GetLatestBlock [METHOD]: /flow.access.AccessAPI/GetAccountAtLatestBlock [METHOD]: /flow.access.AccessAPI/SendTransaction [METHOD]: /flow.access.AccessAPI/ExecuteScriptAtLatestBlock

I have verified though, that we do need codes.Unavailable in the switch case above, so that it triggers a reconnect with:

if err := connect(lastReceivedHeight); err != nil { eventsChan <- models.NewBlockEventsError( fmt.Errorf( "failed to resubscribe for events on height: %d, with: %w", lastReceivedHeight, err, ), ) return }

Note that if we trigger a reconnect, this will prompt the pick_first load balancer to search for the next node which can serve the given request, even if it is unlikely that the current node will suddenly be available.
This will save the EVM Gateway from a fatal error, if the configured backup ANs are indeed available.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cmd/run/cmd.go (1)
204-206: Trim backup host inputs to avoid whitespace-induced dial failures.

Users often pass comma-separated lists with spaces ("host-1.com, host-2.com"). Without trimming, we end up dialing " host-2.com", which gRPC will reject. Filtering out empty/whitespace-only entries at parse time makes the new flag much harder to misconfigure.
 	if accessNodeBackupHosts != "" {
-		cfg.AccessNodeBackupHosts = strings.Split(accessNodeBackupHosts, ",")
+		var hosts []string
+		for _, host := range strings.Split(accessNodeBackupHosts, ",") {
+			host = strings.TrimSpace(host)
+			if host == "" {
+				continue
+			}
+			hosts = append(hosts, host)
+		}
+		cfg.AccessNodeBackupHosts = hosts
 	}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 039545d and 07fa89f.

📒 Files selected for processing (8)

bootstrap/bootstrap.go (3 hunks)
cmd/run/cmd.go (3 hunks)
config/config.go (1 hunks)
services/ingestion/event_subscriber.go (1 hunks)
tests/helpers.go (3 hunks)
tests/integration_test.go (6 hunks)
tests/key_store_release_test.go (1 hunks)
tests/tx_batching_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/integration_test.go (3)

bootstrap/create-multi-key-account.go (1)

CreateMultiKeyAccount (70-194)

config/config.go (2)

Config (43-127)

TxStateValidation (36-36)

bootstrap/bootstrap.go (2)

New (83-111)

Run (740-756)

bootstrap/bootstrap.go

m-Peter · 2025-09-29T11:15:06Z

bootstrap/bootstrap.go

-	DefaultResourceExhaustedMaxRetryDelay = 30 * time.Second
+	// DefaultMaxRetryDelay is the default max request duration when retrying failed
+	// gRPC requests to one of the Access Nodes.
+	DefaultMaxRetryDelay = 30 * time.Second


Now that we are adding load-balancing functionality, should we maybe decrease the max retry duration?

is this load balancing or failovers? I'm not aware of a demand for load balancing requests to different backends, but there is clear need for failing over when the primary is unavailable.

Given that we use pick_first as the load balancing strategy, effectively this works as a failover mechanism. Because it will stick to the same backend, until that backend is unable to serve any requests (due to connectivity issues), in which case it will pick the next available backend.

I only changed the name of the constant, from DefaultResourceExhaustedMaxRetryDelay to DefaultMaxRetryDelay, because previously the retryInterceptor would only retry on ResourceExhausted errors.

But I've updated that condition in f30b0b3, to account for more related errors that we can retry on the same AN.

m-Peter · 2025-09-29T11:20:23Z

bootstrap/bootstrap.go

 				return nil
 			}

-			if status.Code(err) != codes.ResourceExhausted {


Since we added the load balancing config:

`{"loadBalancingConfig": [{"pick_first":{}}]}`

I removed this error code check entirely.

The reason being, if we receive any kind of gRPC error from one of the ANs:

The request will be retried for the max specified duration on the same AN

Then the configured pick_first load-balancing strategy, will try the next ANs, until it finds one that responds without an error

However, I just noticed that the retryInterceptor is used even when there isn't any configured back-up ANs.
Should we just change the DefaultMaxRetryDelay instead?

is there a way to customize the behavior?

for instance, if the backend returned ResourceExhausted, retrying immediately will only make it worse and the node will eventually have to fail over. vs pausing briefly may allow the next request to succeed.

Similarly, if the error is OutOfRange or NotFound, retrying immediately is not likely to succeed.

Canceled or DeadlineExceeded are guaranteed to fail all requests if the source was a local context.

The reason being, if we receive any kind of gRPC error from one of the ANs:

The request will be retried for the max specified duration on the same AN

Then the configured pick_first load-balancing strategy, will try the next ANs, until it finds one that responds without an error

Sorry about that, but I had a misconception about how these 2 relate to each other.

The pick_first load-balancing strategy only checks the node's connectivity state ( Ready / Connecting / Idle / TransientFailure etc ), and that's the only signal it uses in order to connect to the next node.
The errors that might be returned from specific node API calls do not really affect the node selection in any way, thus we can't direct the load-balancer when to connect to a different node inside the retryInterceptor.

So the retryInterceptor and the pick_first load-balancing strategy don't get in each other's feet.

I have added some error handling in retryInterceptor in f30b0b3 , to include the errors you mentioned above.
Is that what you had in mind?

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

tests/tx_batching_test.go (5)

42-56: Fix: range over int does not compile

for i := range totalTxs is invalid; range cannot iterate an int. Use an index loop or iterate the slice you allocated.
- for i := range totalTxs {
+ for i := 0; i < totalTxs; i++ {
   signed, _, err := evmSign(big.NewInt(10), 21000, eoaKey, nonce, &testAddr, nil)
   …
   nonce += 1
 }
126-151: Fix: invalid loop over int in goroutine block

Same issue; replace for range totalTxs with an index loop.
- for range totalTxs {
+ for i := 0; i < totalTxs; i++ {
   signed, _, err := evmSign(
     big.NewInt(transferAmount),
     23_500,
     privateKey,
     nonce,
     &testEoaReceiver,
     nil,
   )
   …
   nonce += 1
 }
227-242: Fix: invalid range over literal uint64(2)

Use a counted loop.
- for i := range uint64(2) {
+ for i := uint64(0); i < 2; i++ {
   signed, _, err := evmSign(
     big.NewInt(500_000),
     23_500,
     privatekey,
     i,
     &testEoaReceiver,
     nil,
   )
336-356: Fix: invalid range over literal uint64(2)

Same pattern; replace with counted loop.
- for i := range uint64(2) {
+ for i := uint64(0); i < 2; i++ {
450-471: Fix: invalid range over literal uint64(2)

Same correction as above.
- for i := range uint64(2) {
+ for i := uint64(0); i < 2; i++ {

bootstrap/bootstrap.go (1)

566-598: Scope retries to transient errors and add exponential backoff with jitter

Retrying on all errors for up to 30s can mask permanent failures (e.g., InvalidArgument, NotFound), inflate latency, and increase load. Limit to transient statuses and add backoff+jitter. Also consider reducing max duration now that failover is handled by LB.

-func retryInterceptor(maxDuration, pauseDuration time.Duration) grpcOpts.UnaryClientInterceptor {
+func retryInterceptor(maxDuration, basePause time.Duration) grpcOpts.UnaryClientInterceptor {
   return func(ctx context.Context, method string, req, reply any, cc *grpcOpts.ClientConn, invoker grpcOpts.UnaryInvoker, opts ...grpcOpts.CallOption) error {
     start := time.Now()
     attempts := 0
     for {
       err := invoker(ctx, method, req, reply, cc, opts...)
       if err == nil {
         return nil
       }
+      // Only retry on transient codes.
+      st, _ := status.FromError(err)
+      switch st.Code() {
+      case codes.Unavailable, codes.DeadlineExceeded, codes.ResourceExhausted, codes.Aborted, codes.Internal:
+        // retry
+      default:
+        return err
+      }
       attempts++
       duration := time.Since(start)
       if duration >= maxDuration {
         return fmt.Errorf("request failed (attempts: %d, duration: %v): %w", attempts, duration, err)
       }
-      select {
+      // Exponential backoff with jitter (cap at 2s).
+      pause := time.Duration(math.Min(float64(2*time.Second), float64(basePause)*math.Pow(2, float64(attempts-1))))
+      jitter := time.Duration(rand.Int63n(int64(pause / 2)))
+      wait := pause/2 + jitter
+      select {
       case <-ctx.Done():
         return ctx.Err()
-      case <-time.After(pauseDuration):
+      case <-time.After(wait):
       }
     }
   }
 }

Also consider DefaultMaxRetryDelay → 5–10s now that pick_first will fail over to backups. Do you want me to open a small follow-up PR with these changes and benchmarks?

♻️ Duplicate comments (1)

bootstrap/bootstrap.go (1)

488-521: Manual resolver + pick_first wiring looks correct now

Target uses the resolver’s scheme with scheme:///authority, resolver state initialized with ordered endpoints, and WithResolvers(mr) + service config applied. This fixes the earlier issue where the manual resolver wasn’t engaged.

🧹 Nitpick comments (6)

services/ingestion/event_subscriber.go (1)
222-245: Add small backoff/jitter before reconnecting on transient errors

To avoid tight reconnect loops (especially when multiple subscribers run), add a brief randomized delay for the transient cases (DeadlineExceeded/Internal/Unavailable).
-   case codes.DeadlineExceeded, codes.Internal, codes.Unavailable:
-     // these are sometimes returned when the stream is disconnected by a middleware or the server
+   case codes.DeadlineExceeded, codes.Internal, codes.Unavailable:
+     // transient disconnects; pause briefly with jitter to reduce thrash
+     time.Sleep(150*time.Millisecond + time.Duration(rand.Intn(200))*time.Millisecond)
tests/helpers.go (2)
95-119: Prefer readiness check over fixed sleep for emulator start

time.Sleep(1s) is flaky on slow CI. Poll emulator readiness (e.g., SDK Ping or wait for gRPC port) before proceeding.
 go func() {
   srv.Start()
 }()
- time.Sleep(1000 * time.Millisecond)
+ // Wait until the emulator responds to Ping (up to 5s)
+ client, _ := grpc.NewClient("localhost:3569")
+ require.Eventually(nil, func() bool { return client.Ping(context.Background()) == nil }, 5*time.Second, 100*time.Millisecond)
418-427: Close HTTP response body to avoid leaks

Add defer res.Body.Close() after Do(req) to free resources promptly.
 res, err := http.DefaultClient.Do(req)
 if err != nil {
   return nil, err
 }
+defer res.Body.Close()
bootstrap/bootstrap.go (2)
44-51: Naming nit: these control total retry duration and pause, not “delay”

Consider DefaultMaxRetryDuration and DefaultRetryInterval to reflect usage.

507-519: Optional: tune connect/backoff params for faster failover

You can bound dial backoff to speed up switching when the first endpoint flaps.
 grpcOpts.WithGRPCDialOptions(
   grpcOpts.WithDefaultCallOptions(grpcOpts.MaxCallRecvMsgSize(DefaultMaxMessageSize)),
   grpcOpts.WithResolvers(mr),
   grpcOpts.WithDefaultServiceConfig(json),
+  grpcOpts.WithConnectParams(grpcOpts.ConnectParams{
+    MinConnectTimeout: 2 * time.Second,
+    Backoff: backoff.Config{BaseDelay: 200 * time.Millisecond, Multiplier: 1.6, Jitter: 0.2, MaxDelay: 2 * time.Second},
+  }),
   grpcOpts.WithUnaryInterceptor(retryInterceptor(
     DefaultMaxRetryDelay,
     DefaultRetryDelay,
   )),
 )
tests/integration_test.go (1)
617-621: Unused context variable.

The backupCancel function is created but never used; the returned context from WithCancel is discarded. While the cleanup via backupSrv.Stop() is sufficient, this creates an unused variable.

Apply this diff to simplify:
-	_, backupCancel := context.WithCancel(context.Background())
-	defer func() {
-		backupCancel()
-		backupSrv.Stop()
-	}()
+	defer backupSrv.Stop()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 779637a and 7306e79.

📒 Files selected for processing (8)

bootstrap/bootstrap.go (3 hunks)
cmd/run/cmd.go (3 hunks)
config/config.go (1 hunks)
services/ingestion/event_subscriber.go (1 hunks)
tests/helpers.go (3 hunks)
tests/integration_test.go (6 hunks)
tests/key_store_release_test.go (1 hunks)
tests/tx_batching_test.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

cmd/run/cmd.go
config/config.go

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-03-07T01:35:09.751Z

Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).

Applied to files:

services/ingestion/event_subscriber.go

🧬 Code graph analysis (1)

tests/integration_test.go (3)

bootstrap/create-multi-key-account.go (1)

CreateMultiKeyAccount (70-194)

config/config.go (2)

Config (43-127)

TxStateValidation (36-36)

bootstrap/bootstrap.go (2)

New (83-111)

Run (741-757)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test

🔇 Additional comments (9)

tests/tx_batching_test.go (1)

518-518: LGTM: explicit emulator server config

Switching to startEmulator(true, defaultServerConfig()) aligns with the new helper signature and improves clarity.

services/ingestion/event_subscriber.go (1)

227-233: Good: treat Unavailable as transient

Including codes.Unavailable in the reconnect path prevents crashes during AN restarts/disconnects. Matches the observed failure mode.

tests/helpers.go (1)

70-94: Panic on config init is acceptable in tests

Using panic for invariant failures during test server config/setup is fine here. Based on learnings.

tests/key_store_release_test.go (1)

23-23: LGTM: adopt new emulator helper signature

Using startEmulator(true, defaultServerConfig()) keeps tests consistent with the new helper.

bootstrap/bootstrap.go (1)

24-26: grpc-go version compatibility confirmed
go.mod declares google.golang.org/grpc v1.75.0, which is ≥ v1.58 and includes resolver.State.Endpoints.

tests/integration_test.go (4)

33-33: LGTM: Centralized server configuration.

Consistent adoption of defaultServerConfig() across existing tests improves maintainability by centralizing emulator configuration logic.

Also applies to: 144-144, 249-249, 328-328, 452-452

596-607: LGTM: Proper test setup and teardown.

Standard test initialization with context cancellation and server cleanup properly deferred.

623-642: LGTM: Standard account creation pattern.

Multi-key account creation follows the established pattern from other integration tests.

644-677: LGTM: Proper gateway bootstrap with backup ANs.

Configuration correctly includes AccessNodeBackupHosts and follows the standard bootstrap pattern with proper lifecycle management.

tests/integration_test.go

peterargue · 2025-10-02T18:57:47Z

services/ingestion/event_subscriber.go

 					// next block is finalized. just wait briefly and try again
 					time.Sleep(200 * time.Millisecond)
-				case codes.DeadlineExceeded, codes.Internal:
+				case codes.DeadlineExceeded, codes.Internal, codes.Unavailable:


I think Unavailable is different since it is unlikely that the node will suddenly be available on reconnect. I think if we receive Unavailable, it should failover to the new node. we could let it retry for some period first, but at the expense of delaying data.

peterargue · 2025-10-02T18:59:00Z

bootstrap/bootstrap.go

+		mr := manual.NewBuilderWithScheme("dns")
+		defer mr.Close()
+
+		json := `{"loadBalancingConfig": [{"pick_first":{}}]}`


can you document what the expected behavior is? I'm assuming that pick_first means it will always use the first one unless it fails, then it will go to the next. it would be good to make it clear here.

Good point, I've added comments in 117deb1, with a link to the official example: https://github.com/grpc/grpc-go/tree/master/examples/features/load_balancing#pick_first .

peterargue · 2025-10-02T19:01:30Z

bootstrap/bootstrap.go

 				return nil
 			}

-			if status.Code(err) != codes.ResourceExhausted {


is there a way to customize the behavior?

for instance, if the backend returned ResourceExhausted, retrying immediately will only make it worse and the node will eventually have to fail over. vs pausing briefly may allow the next request to succeed.

Similarly, if the error is OutOfRange or NotFound, retrying immediately is not likely to succeed.

Canceled or DeadlineExceeded are guaranteed to fail all requests if the source was a local context.

peterargue · 2025-10-02T19:02:49Z

bootstrap/bootstrap.go

-	DefaultResourceExhaustedMaxRetryDelay = 30 * time.Second
+	// DefaultMaxRetryDelay is the default max request duration when retrying failed
+	// gRPC requests to one of the Access Nodes.
+	DefaultMaxRetryDelay = 30 * time.Second


is this load balancing or failovers? I'm not aware of a demand for load balancing requests to different backends, but there is clear need for failing over when the primary is unavailable.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 117deb1 and 667681e.

📒 Files selected for processing (1)

tests/integration_test.go (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/integration_test.go (3)

bootstrap/create-multi-key-account.go (1)

CreateMultiKeyAccount (70-194)

config/config.go (2)

Config (43-127)

TxStateValidation (36-36)

bootstrap/bootstrap.go (2)

New (83-111)

Run (746-762)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test

🔇 Additional comments (5)

tests/integration_test.go (5)

33-33: LGTM! Consistent test helper refactoring.

The signature updates to startEmulator are consistent across all test functions and align with the refactoring to accept explicit server configurations.

Also applies to: 144-144, 249-249, 328-328, 452-452

644-662: Gateway configuration correctly sets up primary and backup ANs.

The configuration properly specifies both AccessNodeHost (primary at port 3569) and AccessNodeBackupHosts (backup at port 3599), enabling the failover behavior to be tested.

686-695: Primary AN validation and shutdown logic is sound.

The test correctly:

Validates that eth_syncing works via the primary AN (lines 686-690)

Explicitly stops the primary emulator to simulate AN unavailability (lines 694-695)

This sets up the conditions for testing backup failover.

697-711: Excellent use of retry loop to handle failover timing.

The assert.Eventually with a 5-second timeout and 500ms retry interval properly addresses the previous review concern about gRPC resolver/load-balancer needing time to detect the primary's unavailability and failover to the backup. This approach eliminates the potential flakiness that would result from an immediate single call after shutdown.

595-711: Well-structured test for backup AN functionality.

The test effectively validates the backup AccessNode failover mechanism:

Starts independent primary and backup emulators

Configures the gateway with both ANs

Validates normal operation via primary

Simulates primary failure

Confirms backup takes over with appropriate retry logic

Note: The test is appropriately scoped to read-only operations (eth_syncing). The multi-key account is created only on the primary emulator, which is sufficient for this test. If future tests need to validate transaction submission after failover, they would need to ensure account state exists on backup ANs as well.

tests/integration_test.go

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 199be73 and f04957f.

📒 Files selected for processing (1)

services/ingestion/event_subscriber.go (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-03-07T01:35:09.751Z

Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).

Applied to files:

services/ingestion/event_subscriber.go

🧬 Code graph analysis (1)

services/ingestion/event_subscriber.go (2)

models/events.go (1)

NewBlockEventsError (252-256)

models/errors/errors.go (1)

ErrDisconnected (24-24)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test

🔇 Additional comments (1)

services/ingestion/event_subscriber.go (1)

227-227: LGTM! Addition of codes.Unavailable enables proper failover to backup ANs.

This change correctly handles the Unavailable error code, allowing the reconnection logic to trigger the pick_first load balancer to fail over to backup Access Nodes when the primary becomes unavailable.

Based on past review comments and PR objectives.

services/ingestion/event_subscriber.go

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f04957f and f30b0b3.

📒 Files selected for processing (1)

bootstrap/bootstrap.go (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test

🔇 Additional comments (4)

bootstrap/bootstrap.go (4)
25-26: LGTM! Imports align with manual resolver implementation.

The gRPC resolver imports are correctly added to support the new manual resolver-based dialing for backup Access Nodes.

46-52: LGTM! Constant renaming reflects broader retry scope.

The constants are appropriately renamed from DefaultResourceExhaustedRetryDelay/DefaultResourceExhaustedMaxRetryDelay to DefaultRetryDelay/DefaultMaxRetryDelay, aligning with the expanded retry logic in retryInterceptor that now handles multiple error types beyond just ResourceExhausted.

490-538: Backup Access Node implementation is well-structured (pending resolver lifecycle fix).

The conditional logic correctly distinguishes between configurations with and without backup hosts:

Backup path: Uses manual resolver with pick_first load balancing to enable failover.

Single-host path: Retains existing direct dial behavior.

The endpoint construction (lines 500-508) correctly places the primary host first, followed by backup hosts, ensuring proper failover order. The target URI format "dns:///flow-access" is correct and engages the manual resolver as intended.

However, the critical issue with defer mr.Close() on line 492 must be addressed before this implementation can function correctly.

592-603: Verify that Unavailable errors should trigger immediate failover without retry.

The updated retry logic treats different error codes distinctly:

Canceled/DeadlineExceeded: Non-retriable (correct—guaranteed to fail on local context).

ResourceExhausted/OutOfRange/NotFound: Retried with a brief pause (reasonable—gives the AN a chance to recover).

Other errors (including Unavailable): Return immediately without retrying on the same backend.

For Unavailable errors (common when an AN is down or unreachable), the current implementation skips retrying on the same backend and allows the pick_first load balancer to immediately switch to a backup AN. This seems intentional for fast failover.

However, based on past review discussions about whether to retry all errors or only specific ones, please confirm this is the intended behavior. Specifically:

Should Unavailable errors be retried with a pause on the same backend first, or immediately handed off to the load balancer for failover?

Are there any other error codes (e.g., Internal, Unknown) that should be explicitly handled?

If the current behavior is correct, consider adding a comment explaining that Unavailable errors trigger immediate failover for resilience:
 		switch status.Code(err) {
 		case codes.Canceled, codes.DeadlineExceeded:
 			// these kind of errors are guaranteed to fail all requests,
 			// if the source was a local context
 			return err
 		case codes.ResourceExhausted, codes.OutOfRange, codes.NotFound:
 			// when we receive these errors, we pause briefly, so that
 			// the next request on the same AN, has a higher chance
 			// of success.
 		default:
+			// For connection errors like Unavailable, return immediately
+			// to allow the load balancer to fail over to a backup AN.
 			return err
 		}

bootstrap/bootstrap.go

… balancing

…Config'

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

bootstrap/bootstrap.go (1)
482-518: Keep the manual resolver alive for the client lifetime

Line 484: defer mr.Close() runs as soon as this function returns, tearing down the resolver the client depends on for failover. Remove the defer and close the resolver when the client is shut down instead (e.g., store mr and call Close() from StopClient).
-		mr := manual.NewBuilderWithScheme("dns")
-		defer mr.Close()
+		mr := manual.NewBuilderWithScheme("dns")
+		// Keep this resolver alive until the client is closed.

🧹 Nitpick comments (1)

cmd/run/cmd.go (1)
204-206: Trim whitespace when parsing backup hosts

Line 205: CLI values like "host1, host2" will leave a leading space on the second entry, producing an invalid dial target. Trim the segments before assigning to cfg.AccessNodeBackupHosts.
-	if accessNodeBackupHosts != "" {
-		cfg.AccessNodeBackupHosts = strings.Split(accessNodeBackupHosts, ",")
-	}
+	if accessNodeBackupHosts != "" {
+		hosts := strings.Split(accessNodeBackupHosts, ",")
+		for i := range hosts {
+			hosts[i] = strings.TrimSpace(hosts[i])
+		}
+		cfg.AccessNodeBackupHosts = hosts
+	}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f30b0b3 and 422d66e.

📒 Files selected for processing (8)

bootstrap/bootstrap.go (4 hunks)
cmd/run/cmd.go (3 hunks)
config/config.go (1 hunks)
services/ingestion/event_subscriber.go (1 hunks)
tests/helpers.go (3 hunks)
tests/integration_test.go (6 hunks)
tests/key_store_release_test.go (1 hunks)
tests/tx_batching_test.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/key_store_release_test.go

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-10-06T10:14:49.676Z

Learnt from: m-Peter
PR: onflow/flow-evm-gateway#890
File: services/ingestion/event_subscriber.go:235-239
Timestamp: 2025-10-06T10:14:49.676Z
Learning: In services/ingestion/event_subscriber.go, when reconnecting after disconnect errors (DeadlineExceeded, Internal, Unavailable), the subscription should reconnect at lastReceivedHeight rather than lastReceivedHeight+1. This avoids errors when the next height doesn't exist yet, and duplicate event processing is safe because the ingestion engine is explicitly designed to be idempotent (storage uses batch.Set() which overwrites existing entries).

Applied to files:

services/ingestion/event_subscriber.go

📚 Learning: 2025-03-07T01:35:09.751Z

Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).

Applied to files:

services/ingestion/event_subscriber.go

🧬 Code graph analysis (2)

services/ingestion/event_subscriber.go (2)

models/events.go (1)

NewBlockEventsError (252-256)

models/errors/errors.go (1)

ErrDisconnected (24-24)

tests/integration_test.go (3)

bootstrap/create-multi-key-account.go (1)

CreateMultiKeyAccount (70-194)

config/config.go (3)

Config (43-127)

TxStateValidation (36-36)

LocalIndexValidation (39-39)

bootstrap/bootstrap.go (2)

New (85-113)

Run (753-769)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test

🔇 Additional comments (7)

tests/helpers.go (2)

70-93: LGTM! Clean refactoring for test configuration reusability.

Extracting the server configuration into defaultServerConfig() improves maintainability and allows tests to provide custom configurations when needed. Using panic for error handling is acceptable in test helpers, though it differs from typical Go patterns.

95-119: LGTM! Signature change enables configurable test emulators.

The expanded signature allows tests to provide custom server configurations, supporting scenarios like the new backup AccessNode test. The implementation correctly uses the provided configuration.

tests/integration_test.go (5)

33-33: LGTM! Consistent updates to match new startEmulator signature.

All test functions correctly updated to pass defaultServerConfig() to startEmulator, maintaining consistency across the test suite.

Also applies to: 144-144, 249-249, 328-328, 452-452

596-622: LGTM! Well-structured backup AccessNode test setup.

The test correctly:

Sets up primary and backup emulators with non-conflicting ports

Maintains separate contexts for independent lifecycle management

Addresses past review feedback on context handling

The backup emulator configuration with distinct GRPC/REST/Admin/Debugger ports ensures proper isolation during failover testing.

644-662: LGTM! Gateway configuration correctly includes backup hosts.

The configuration properly wires in the AccessNodeBackupHosts field, demonstrating the new failover functionality introduced in this PR. The backup host correctly references the alternate emulator port.

664-690: LGTM! Proper gateway initialization and primary validation.

The test correctly:

Initializes the gateway with backup host configuration

Ensures proper cleanup with deferred boot.Stop()

Validates the primary AccessNode is serving requests before testing failover

692-711: LGTM! Robust failover validation with proper retry logic.

The test correctly:

Shuts down the primary AccessNode to trigger failover

Uses assert.Eventually with retry logic to account for gRPC resolver/load balancer detection time (addresses past review feedback)

Uses backupCtx for validation since the primary context was cancelled

Employs reasonable timeout (5s) and polling interval (500ms) for failover detection

This validates that the backup AccessNode successfully takes over when the primary becomes unavailable, fulfilling the PR's core objective.

services/ingestion/event_subscriber.go

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

services/ingestion/event_subscriber.go (1)
235-258: LGTM! Retry loop correctly implements 30-second reconnection window.

The retry logic properly handles reconnection attempts with appropriate timeout and pause intervals. Reconnecting at lastReceivedHeight (not +1) is correct because the ingestion engine is explicitly idempotent.

Based on learnings.

Optional: Remove unused attempts variable.

The attempts variable (line 236) is incremented but never used. If you plan to log or expose retry metrics, retain it; otherwise, consider removing it to reduce clutter.

Apply this diff if the variable is not needed:
-			attempts := 0
 			pauseDuration, maxDuration := 200*time.Millisecond, 30*time.Second
 			// Allow reconnect retries for up to 30 seconds, with retry
 			// attempts every 200 ms.
 			for {
 				err := connect(lastReceivedHeight)
 				if err == nil {
 					break
 				}

-				attempts++
 				duration := time.Since(start)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 422d66e and ce4808f.

📒 Files selected for processing (1)

services/ingestion/event_subscriber.go (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-10-06T10:14:49.676Z

Learnt from: m-Peter
PR: onflow/flow-evm-gateway#890
File: services/ingestion/event_subscriber.go:235-239
Timestamp: 2025-10-06T10:14:49.676Z
Learning: In services/ingestion/event_subscriber.go, when reconnecting after disconnect errors (DeadlineExceeded, Internal, Unavailable), the subscription should reconnect at lastReceivedHeight rather than lastReceivedHeight+1. This avoids errors when the next height doesn't exist yet, and duplicate event processing is safe because the ingestion engine is explicitly designed to be idempotent (storage uses batch.Set() which overwrites existing entries).

Applied to files:

services/ingestion/event_subscriber.go

📚 Learning: 2025-03-07T01:35:09.751Z

Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).

Applied to files:

services/ingestion/event_subscriber.go

🧬 Code graph analysis (1)

services/ingestion/event_subscriber.go (2)

models/events.go (1)

NewBlockEventsError (252-256)

models/errors/errors.go (1)

ErrDisconnected (24-24)

🔇 Additional comments (1)

services/ingestion/event_subscriber.go (1)

227-227: LGTM! Enables failover to backup Access Nodes.

Adding codes.Unavailable to the reconnection trigger is correct. When the primary AN becomes unavailable, this ensures the retry loop engages and the pick_first load balancer can select the next available backup host, preventing fatal errors.

Based on learnings and past review discussions.

m-Peter self-assigned this Sep 26, 2025

m-Peter added the Improvement label Sep 26, 2025

m-Peter force-pushed the mpeter/access-nodes-network-resilience branch from 2baf5d5 to b9c76eb Compare September 26, 2025 13:28

m-Peter changed the title ~~Allow adding back-up AN hosts to be used for client-side request load balancing~~ Allow setting back-up ANs to be used for networking resilience Sep 29, 2025

m-Peter force-pushed the mpeter/access-nodes-network-resilience branch from 87552da to 2883e0e Compare September 29, 2025 06:35

m-Peter changed the title ~~Allow setting back-up ANs to be used for networking resilience~~ Allow setting back-up ANs for networking resilience Sep 29, 2025

m-Peter marked this pull request as ready for review September 29, 2025 11:02

m-Peter requested review from janezpodhostnik, peterargue and zhangchiqing as code owners September 29, 2025 11:02

m-Peter commented Sep 29, 2025

View reviewed changes

coderabbitai bot reviewed Sep 29, 2025

View reviewed changes

bootstrap/bootstrap.go Outdated Show resolved Hide resolved

m-Peter commented Sep 29, 2025

View reviewed changes

m-Peter force-pushed the mpeter/access-nodes-network-resilience branch from 779637a to 7306e79 Compare October 1, 2025 07:25

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

tests/integration_test.go Show resolved Hide resolved

peterargue reviewed Oct 2, 2025

View reviewed changes

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

tests/integration_test.go Outdated Show resolved Hide resolved

tests/integration_test.go Show resolved Hide resolved

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

services/ingestion/event_subscriber.go Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

bootstrap/bootstrap.go Show resolved Hide resolved

m-Peter added 9 commits October 9, 2025 16:09

Allow adding back-up AN hosts to be used for client-side request load…

5f59a45

… balancing

The retryInterceptor should retry requests on any kind of error

3fcc031

Allow EventSubscriber to re-connect on codes.Unavailable errors

5b35970

Add E2E tests for AN backup hosts switch functionality

9c8ba82

Improve target host creation for load-balancing

86ae4b1

Describe behavior of the 'pick_first' strategy used on 'loadBalancing…

46683ee

…Config'

Improve E2E load-balancing test to remove flakiness

e3f57a0

Use proper ctx during E2E tests on backup AN functionality

9a74d72

EVM event subscription should reconnect on the last received height

0dd4f58

Improve error handling on retryInterceptor

9e20910

m-Peter force-pushed the mpeter/access-nodes-network-resilience branch from f30b0b3 to 422d66e Compare October 9, 2025 13:32

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

services/ingestion/event_subscriber.go Show resolved Hide resolved

Retry reconnects on EVM event subscriber for up to 30 seconds

ce4808f

m-Peter force-pushed the mpeter/access-nodes-network-resilience branch from 422d66e to ce4808f Compare October 9, 2025 13:47

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

Allow setting back-up ANs for networking resilience #890

Are you sure you want to change the base?

Allow setting back-up ANs for networking resilience #890

Uh oh!

Conversation

m-Peter commented Sep 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-Peter Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-Peter Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-Peter Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

m-Peter commented Sep 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 26, 2025 •

edited

Loading

m-Peter Oct 6, 2025 •

edited

Loading

m-Peter Oct 6, 2025 •

edited

Loading

m-Peter Oct 6, 2025 •

edited

Loading