Skip to content

Conversation

@m-Peter
Copy link
Collaborator

@m-Peter m-Peter commented Sep 26, 2025

Closes: #764

Description

I tried this locally, by setting up 2 Flow Emulator processes, on different ports:
1st process:

flow emulator -v

2nd process:

flow emulator -v --port=3599 --rest-port=9999 --admin-port=9090 --debugger-port=3456

Then I terminated the 1st process, which was configured to be the main AN (AccessNodeHost).
The EVM Gateway continued operating normally, by directing requests to AccessNodeBackupHosts, the 2nd process.

This was mainly achieved with these 2 gRPC options:

grpcOpts.WithResolvers(mr),
grpcOpts.WithDefaultServiceConfig(json)

We make use of the pick_first built-in client-side load balancer, and we still keep our custom retryInterceptor, which will first retry erroneous requests to the same AN, before letting the load balancer pick the first available.


For contributor use:

  • Targeted PR against master branch
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Code follows the standards mentioned here.
  • Updated relevant documentation
  • Re-reviewed Files changed in the Github PR explorer
  • Added appropriate labels

Summary by CodeRabbit

  • New Features

    • New CLI flag and config field to specify multiple backup Access Node hosts; client can use resolver-based endpoints for failover.
  • Bug Fixes

    • Refined retry/backoff behavior and retryable error handling; renamed timing constants; expanded transient-error handling and replaced single resubscribe attempt with a retry-until-success-or-timeout loop.
  • Tests

    • Added integration test for Access Node backup failover; test harness now accepts an explicit server configuration.

@m-Peter m-Peter self-assigned this Sep 26, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 26, 2025

Walkthrough

Adds support for dialing Access Nodes via a manual gRPC resolver (primary + backups with pick_first LB), updates retry/interceptor constants and behavior, makes subscriber reconnects retry with a 30s window, adds CLI/config for backup hosts, and refactors tests to accept an explicit server config plus a new integration test for AN failover.

Changes

Cohort / File(s) Summary
Bootstrap: gRPC resolver & retry renames
bootstrap/bootstrap.go
Adds resolver-based dial path when AccessNodeBackupHosts present (manual resolver + pick_first service config, endpoints = primary + backups); retains legacy single-host dial otherwise. Replaces ResourceExhausted-specific retry constants with DefaultRetryDelay and DefaultMaxRetryDelay. Updates imports for resolver packages.
Config & CLI
config/config.go, cmd/run/cmd.go
Adds AccessNodeBackupHosts []string to Config and --access-node-backup-hosts CLI flag; parses comma-separated backup hosts into cfg.AccessNodeBackupHosts.
Subscriber reconnect behavior
services/ingestion/event_subscriber.go
Treats codes.Unavailable like transient errors (DeadlineExceeded/Internal); replaces a single resubscribe attempt with a retry loop that repeatedly calls connect(lastReceivedHeight) with 200ms waits up to a 30s timeout before returning BlockEventsError.
Test helpers refactor
tests/helpers.go
Adds defaultServerConfig() *server.Config; changes startEmulator(createTestAccounts bool)startEmulator(createTestAccounts bool, conf *server.Config) and uses provided config for emulator startup.
Integration test: AN failover
tests/integration_test.go
Adds Test_AccessNodeBackupFunctionality to start primary + backup emulators, boot gateway with AccessNodeBackupHosts, and verify client SyncProgress served by primary then by backup after primary stops. Updates other tests to pass defaultServerConfig().
Other tests updates
tests/key_store_release_test.go, tests/tx_batching_test.go
Update calls to startEmulator to include defaultServerConfig(); no behavioral changes beyond using explicit config.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant App as Gateway Bootstrap
  participant CFG as Config
  participant R as Manual Resolver
  participant GRPC as gRPC Client
  participant AN as Access Nodes

  Note over CFG: AccessNodeHost + optional AccessNodeBackupHosts

  App->>CFG: Read AccessNodeHost, BackupHosts
  alt BackupHosts provided
    App->>R: Build endpoints list (primary + backups)
    R-->>GRPC: Provide resolver InitialState (endpoints)
    App->>GRPC: Dial with resolver + service config (pick_first) + retry interceptor
    GRPC->>AN: Connect via LB across endpoints
  else No backups
    App->>GRPC: Dial single host (legacy path)
    GRPC->>AN: Connect to primary only
  end
Loading
sequenceDiagram
  autonumber
  participant Sub as RPCEventSubscriber
  participant AN as Access Node
  participant Timer as Timer/Backoff

  Sub->>AN: Subscribe / Stream
  AN--xSub: Error (DeadlineExceeded / Internal / Unavailable / NotFound)
  alt NotFound
    Sub->>Timer: Short wait
    Timer-->>Sub: Retry subscribe
  else Transient error
    loop Retry loop (up to 30s)
      Sub->>Timer: Wait 200ms
      Timer-->>Sub: Retry connect(lastReceivedHeight)
      Sub->>AN: Attempt reconnect/resubscribe
      AN-->>Sub: Success or continue loop
    end
    alt Timeout reached
      Sub->>Sub: Return BlockEventsError(lastHeight, lastErr)
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • zhangchiqing
  • janezpodhostnik
  • peterargue

Poem

"I map the hosts in tidy rows,
I pick the one where traffic goes.
When primary naps, backups hop in line,
I hop, reconnect — the streams align.
Thump-thump, the gateway hums — all fine! 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Out of Scope Changes Check ⚠️ Warning The modifications in services/ingestion/event_subscriber.go introduce generic retry and reconnection logic for event subscription that is unrelated to the backup Access Node failover functionality described in Issue #764, representing an out-of-scope change. Consider removing or splitting the event_subscriber retry enhancements into a separate pull request focused on event subscription resilience.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title clearly and concisely describes the primary change of adding backup Access Nodes to enhance networking resilience, accurately reflecting the core objective without extraneous details.
Linked Issues Check ✅ Passed The changes introduce a configurable list of backup Access Nodes via a CLI flag and Config field, implement gRPC client logic with a resolver and retry interceptor to gracefully fail over to backups, and include integration tests that validate automatic switchover, fully satisfying the objectives of Issue #764.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch mpeter/access-nodes-network-resilience

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@m-Peter m-Peter force-pushed the mpeter/access-nodes-network-resilience branch from 2baf5d5 to b9c76eb Compare September 26, 2025 13:28
@m-Peter m-Peter changed the title Allow adding back-up AN hosts to be used for client-side request load balancing Allow setting back-up ANs to be used for networking resilience Sep 29, 2025
@m-Peter m-Peter force-pushed the mpeter/access-nodes-network-resilience branch from 87552da to 2883e0e Compare September 29, 2025 06:35
@m-Peter m-Peter changed the title Allow setting back-up ANs to be used for networking resilience Allow setting back-up ANs for networking resilience Sep 29, 2025
@m-Peter m-Peter marked this pull request as ready for review September 29, 2025 11:02
// next block is finalized. just wait briefly and try again
time.Sleep(200 * time.Millisecond)
case codes.DeadlineExceeded, codes.Internal:
case codes.DeadlineExceeded, codes.Internal, codes.Unavailable:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During some local testing with having 2 Flow Emulator processes, when I killed the 1st process, which was configured as the main AN(AccessNodeHost), then the EVM GW crashed with:

failure in event subscription with: recoverable: disconnected:
error receiving event: rpc error:
code = Unavailable desc = error reading from server: EOF

Adding the codes.Unavailable case, solved this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Unavailable is different since it is unlikely that the node will suddenly be available on reconnect. I think if we receive Unavailable, it should failover to the new node. we could let it retry for some period first, but at the expense of delaying data.

Copy link
Collaborator Author

@m-Peter m-Peter Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure how SubscribeEventsByBlockHeight is implemented under the hood, but I've observed that it doesn't go through the retryInterceptor, unlike other AN calls. For example:

[METHOD]:  /flow.access.AccessAPI/GetLatestBlock
[METHOD]:  /flow.access.AccessAPI/GetAccountAtLatestBlock
[METHOD]:  /flow.access.AccessAPI/SendTransaction
[METHOD]:  /flow.access.AccessAPI/ExecuteScriptAtLatestBlock

I have verified though, that we do need codes.Unavailable in the switch case above, so that it triggers a reconnect with:

if err := connect(lastReceivedHeight); err != nil {
	eventsChan <- models.NewBlockEventsError(
		fmt.Errorf(
			"failed to resubscribe for events on height: %d, with: %w",
			lastReceivedHeight,
			err,
		),
	)
	return
}

Note that if we trigger a reconnect, this will prompt the pick_first load balancer to search for the next node which can serve the given request, even if it is unlikely that the current node will suddenly be available.
This will save the EVM Gateway from a fatal error, if the configured backup ANs are indeed available.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cmd/run/cmd.go (1)

204-206: Trim backup host inputs to avoid whitespace-induced dial failures.

Users often pass comma-separated lists with spaces ("host-1.com, host-2.com"). Without trimming, we end up dialing " host-2.com", which gRPC will reject. Filtering out empty/whitespace-only entries at parse time makes the new flag much harder to misconfigure.

 	if accessNodeBackupHosts != "" {
-		cfg.AccessNodeBackupHosts = strings.Split(accessNodeBackupHosts, ",")
+		var hosts []string
+		for _, host := range strings.Split(accessNodeBackupHosts, ",") {
+			host = strings.TrimSpace(host)
+			if host == "" {
+				continue
+			}
+			hosts = append(hosts, host)
+		}
+		cfg.AccessNodeBackupHosts = hosts
 	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 039545d and 07fa89f.

📒 Files selected for processing (8)
  • bootstrap/bootstrap.go (3 hunks)
  • cmd/run/cmd.go (3 hunks)
  • config/config.go (1 hunks)
  • services/ingestion/event_subscriber.go (1 hunks)
  • tests/helpers.go (3 hunks)
  • tests/integration_test.go (6 hunks)
  • tests/key_store_release_test.go (1 hunks)
  • tests/tx_batching_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/integration_test.go (3)
bootstrap/create-multi-key-account.go (1)
  • CreateMultiKeyAccount (70-194)
config/config.go (2)
  • Config (43-127)
  • TxStateValidation (36-36)
bootstrap/bootstrap.go (2)
  • New (83-111)
  • Run (740-756)

DefaultResourceExhaustedMaxRetryDelay = 30 * time.Second
// DefaultMaxRetryDelay is the default max request duration when retrying failed
// gRPC requests to one of the Access Nodes.
DefaultMaxRetryDelay = 30 * time.Second
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are adding load-balancing functionality, should we maybe decrease the max retry duration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this load balancing or failovers? I'm not aware of a demand for load balancing requests to different backends, but there is clear need for failing over when the primary is unavailable.

Copy link
Collaborator Author

@m-Peter m-Peter Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we use pick_first as the load balancing strategy, effectively this works as a failover mechanism. Because it will stick to the same backend, until that backend is unable to serve any requests (due to connectivity issues), in which case it will pick the next available backend.

I only changed the name of the constant, from DefaultResourceExhaustedMaxRetryDelay to DefaultMaxRetryDelay, because previously the retryInterceptor would only retry on ResourceExhausted errors.

But I've updated that condition in f30b0b3, to account for more related errors that we can retry on the same AN.

return nil
}

if status.Code(err) != codes.ResourceExhausted {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we added the load balancing config:

`{"loadBalancingConfig": [{"pick_first":{}}]}`

I removed this error code check entirely.

The reason being, if we receive any kind of gRPC error from one of the ANs:

  1. The request will be retried for the max specified duration on the same AN
  2. Then the configured pick_first load-balancing strategy, will try the next ANs, until it finds one that responds without an error

However, I just noticed that the retryInterceptor is used even when there isn't any configured back-up ANs.
Should we just change the DefaultMaxRetryDelay instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to customize the behavior?

for instance, if the backend returned ResourceExhausted, retrying immediately will only make it worse and the node will eventually have to fail over. vs pausing briefly may allow the next request to succeed.

Similarly, if the error is OutOfRange or NotFound, retrying immediately is not likely to succeed.

Canceled or DeadlineExceeded are guaranteed to fail all requests if the source was a local context.

Copy link
Collaborator Author

@m-Peter m-Peter Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason being, if we receive any kind of gRPC error from one of the ANs:

  1. The request will be retried for the max specified duration on the same AN
  2. Then the configured pick_first load-balancing strategy, will try the next ANs, until it finds one that responds without an error

Sorry about that, but I had a misconception about how these 2 relate to each other.

The pick_first load-balancing strategy only checks the node's connectivity state ( Ready / Connecting / Idle / TransientFailure etc ), and that's the only signal it uses in order to connect to the next node.
The errors that might be returned from specific node API calls do not really affect the node selection in any way, thus we can't direct the load-balancer when to connect to a different node inside the retryInterceptor.

So the retryInterceptor and the pick_first load-balancing strategy don't get in each other's feet.

I have added some error handling in retryInterceptor in f30b0b3 , to include the errors you mentioned above.
Is that what you had in mind?

@m-Peter m-Peter force-pushed the mpeter/access-nodes-network-resilience branch from 779637a to 7306e79 Compare October 1, 2025 07:25
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
tests/tx_batching_test.go (5)

42-56: Fix: range over int does not compile

for i := range totalTxs is invalid; range cannot iterate an int. Use an index loop or iterate the slice you allocated.

- for i := range totalTxs {
+ for i := 0; i < totalTxs; i++ {
   signed, _, err := evmSign(big.NewInt(10), 21000, eoaKey, nonce, &testAddr, nil)
   …
   nonce += 1
 }

126-151: Fix: invalid loop over int in goroutine block

Same issue; replace for range totalTxs with an index loop.

- for range totalTxs {
+ for i := 0; i < totalTxs; i++ {
   signed, _, err := evmSign(
     big.NewInt(transferAmount),
     23_500,
     privateKey,
     nonce,
     &testEoaReceiver,
     nil,
   )
   …
   nonce += 1
 }

227-242: Fix: invalid range over literal uint64(2)

Use a counted loop.

- for i := range uint64(2) {
+ for i := uint64(0); i < 2; i++ {
   signed, _, err := evmSign(
     big.NewInt(500_000),
     23_500,
     privatekey,
     i,
     &testEoaReceiver,
     nil,
   )

336-356: Fix: invalid range over literal uint64(2)

Same pattern; replace with counted loop.

- for i := range uint64(2) {
+ for i := uint64(0); i < 2; i++ {

450-471: Fix: invalid range over literal uint64(2)

Same correction as above.

- for i := range uint64(2) {
+ for i := uint64(0); i < 2; i++ {
bootstrap/bootstrap.go (1)

566-598: Scope retries to transient errors and add exponential backoff with jitter

Retrying on all errors for up to 30s can mask permanent failures (e.g., InvalidArgument, NotFound), inflate latency, and increase load. Limit to transient statuses and add backoff+jitter. Also consider reducing max duration now that failover is handled by LB.

-func retryInterceptor(maxDuration, pauseDuration time.Duration) grpcOpts.UnaryClientInterceptor {
+func retryInterceptor(maxDuration, basePause time.Duration) grpcOpts.UnaryClientInterceptor {
   return func(ctx context.Context, method string, req, reply any, cc *grpcOpts.ClientConn, invoker grpcOpts.UnaryInvoker, opts ...grpcOpts.CallOption) error {
     start := time.Now()
     attempts := 0
     for {
       err := invoker(ctx, method, req, reply, cc, opts...)
       if err == nil {
         return nil
       }
+      // Only retry on transient codes.
+      st, _ := status.FromError(err)
+      switch st.Code() {
+      case codes.Unavailable, codes.DeadlineExceeded, codes.ResourceExhausted, codes.Aborted, codes.Internal:
+        // retry
+      default:
+        return err
+      }
       attempts++
       duration := time.Since(start)
       if duration >= maxDuration {
         return fmt.Errorf("request failed (attempts: %d, duration: %v): %w", attempts, duration, err)
       }
-      select {
+      // Exponential backoff with jitter (cap at 2s).
+      pause := time.Duration(math.Min(float64(2*time.Second), float64(basePause)*math.Pow(2, float64(attempts-1))))
+      jitter := time.Duration(rand.Int63n(int64(pause / 2)))
+      wait := pause/2 + jitter
+      select {
       case <-ctx.Done():
         return ctx.Err()
-      case <-time.After(pauseDuration):
+      case <-time.After(wait):
       }
     }
   }
 }

Also consider DefaultMaxRetryDelay → 5–10s now that pick_first will fail over to backups. Do you want me to open a small follow-up PR with these changes and benchmarks?

♻️ Duplicate comments (1)
bootstrap/bootstrap.go (1)

488-521: Manual resolver + pick_first wiring looks correct now

Target uses the resolver’s scheme with scheme:///authority, resolver state initialized with ordered endpoints, and WithResolvers(mr) + service config applied. This fixes the earlier issue where the manual resolver wasn’t engaged.

🧹 Nitpick comments (6)
services/ingestion/event_subscriber.go (1)

222-245: Add small backoff/jitter before reconnecting on transient errors

To avoid tight reconnect loops (especially when multiple subscribers run), add a brief randomized delay for the transient cases (DeadlineExceeded/Internal/Unavailable).

-   case codes.DeadlineExceeded, codes.Internal, codes.Unavailable:
-     // these are sometimes returned when the stream is disconnected by a middleware or the server
+   case codes.DeadlineExceeded, codes.Internal, codes.Unavailable:
+     // transient disconnects; pause briefly with jitter to reduce thrash
+     time.Sleep(150*time.Millisecond + time.Duration(rand.Intn(200))*time.Millisecond)
tests/helpers.go (2)

95-119: Prefer readiness check over fixed sleep for emulator start

time.Sleep(1s) is flaky on slow CI. Poll emulator readiness (e.g., SDK Ping or wait for gRPC port) before proceeding.

 go func() {
   srv.Start()
 }()
- time.Sleep(1000 * time.Millisecond)
+ // Wait until the emulator responds to Ping (up to 5s)
+ client, _ := grpc.NewClient("localhost:3569")
+ require.Eventually(nil, func() bool { return client.Ping(context.Background()) == nil }, 5*time.Second, 100*time.Millisecond)

418-427: Close HTTP response body to avoid leaks

Add defer res.Body.Close() after Do(req) to free resources promptly.

 res, err := http.DefaultClient.Do(req)
 if err != nil {
   return nil, err
 }
+defer res.Body.Close()
bootstrap/bootstrap.go (2)

44-51: Naming nit: these control total retry duration and pause, not “delay”

Consider DefaultMaxRetryDuration and DefaultRetryInterval to reflect usage.


507-519: Optional: tune connect/backoff params for faster failover

You can bound dial backoff to speed up switching when the first endpoint flaps.

 grpcOpts.WithGRPCDialOptions(
   grpcOpts.WithDefaultCallOptions(grpcOpts.MaxCallRecvMsgSize(DefaultMaxMessageSize)),
   grpcOpts.WithResolvers(mr),
   grpcOpts.WithDefaultServiceConfig(json),
+  grpcOpts.WithConnectParams(grpcOpts.ConnectParams{
+    MinConnectTimeout: 2 * time.Second,
+    Backoff: backoff.Config{BaseDelay: 200 * time.Millisecond, Multiplier: 1.6, Jitter: 0.2, MaxDelay: 2 * time.Second},
+  }),
   grpcOpts.WithUnaryInterceptor(retryInterceptor(
     DefaultMaxRetryDelay,
     DefaultRetryDelay,
   )),
 )
tests/integration_test.go (1)

617-621: Unused context variable.

The backupCancel function is created but never used; the returned context from WithCancel is discarded. While the cleanup via backupSrv.Stop() is sufficient, this creates an unused variable.

Apply this diff to simplify:

-	_, backupCancel := context.WithCancel(context.Background())
-	defer func() {
-		backupCancel()
-		backupSrv.Stop()
-	}()
+	defer backupSrv.Stop()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 779637a and 7306e79.

📒 Files selected for processing (8)
  • bootstrap/bootstrap.go (3 hunks)
  • cmd/run/cmd.go (3 hunks)
  • config/config.go (1 hunks)
  • services/ingestion/event_subscriber.go (1 hunks)
  • tests/helpers.go (3 hunks)
  • tests/integration_test.go (6 hunks)
  • tests/key_store_release_test.go (1 hunks)
  • tests/tx_batching_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/run/cmd.go
  • config/config.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-03-07T01:35:09.751Z
Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).

Applied to files:

  • services/ingestion/event_subscriber.go
🧬 Code graph analysis (1)
tests/integration_test.go (3)
bootstrap/create-multi-key-account.go (1)
  • CreateMultiKeyAccount (70-194)
config/config.go (2)
  • Config (43-127)
  • TxStateValidation (36-36)
bootstrap/bootstrap.go (2)
  • New (83-111)
  • Run (741-757)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test
🔇 Additional comments (9)
tests/tx_batching_test.go (1)

518-518: LGTM: explicit emulator server config

Switching to startEmulator(true, defaultServerConfig()) aligns with the new helper signature and improves clarity.

services/ingestion/event_subscriber.go (1)

227-233: Good: treat Unavailable as transient

Including codes.Unavailable in the reconnect path prevents crashes during AN restarts/disconnects. Matches the observed failure mode.

tests/helpers.go (1)

70-94: Panic on config init is acceptable in tests

Using panic for invariant failures during test server config/setup is fine here. Based on learnings.

tests/key_store_release_test.go (1)

23-23: LGTM: adopt new emulator helper signature

Using startEmulator(true, defaultServerConfig()) keeps tests consistent with the new helper.

bootstrap/bootstrap.go (1)

24-26: grpc-go version compatibility confirmed
go.mod declares google.golang.org/grpc v1.75.0, which is ≥ v1.58 and includes resolver.State.Endpoints.

tests/integration_test.go (4)

33-33: LGTM: Centralized server configuration.

Consistent adoption of defaultServerConfig() across existing tests improves maintainability by centralizing emulator configuration logic.

Also applies to: 144-144, 249-249, 328-328, 452-452


596-607: LGTM: Proper test setup and teardown.

Standard test initialization with context cancellation and server cleanup properly deferred.


623-642: LGTM: Standard account creation pattern.

Multi-key account creation follows the established pattern from other integration tests.


644-677: LGTM: Proper gateway bootstrap with backup ANs.

Configuration correctly includes AccessNodeBackupHosts and follows the standard bootstrap pattern with proper lifecycle management.

// next block is finalized. just wait briefly and try again
time.Sleep(200 * time.Millisecond)
case codes.DeadlineExceeded, codes.Internal:
case codes.DeadlineExceeded, codes.Internal, codes.Unavailable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Unavailable is different since it is unlikely that the node will suddenly be available on reconnect. I think if we receive Unavailable, it should failover to the new node. we could let it retry for some period first, but at the expense of delaying data.

mr := manual.NewBuilderWithScheme("dns")
defer mr.Close()

json := `{"loadBalancingConfig": [{"pick_first":{}}]}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document what the expected behavior is? I'm assuming that pick_first means it will always use the first one unless it fails, then it will go to the next. it would be good to make it clear here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've added comments in 117deb1, with a link to the official example: https://github.com/grpc/grpc-go/tree/master/examples/features/load_balancing#pick_first .

return nil
}

if status.Code(err) != codes.ResourceExhausted {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to customize the behavior?

for instance, if the backend returned ResourceExhausted, retrying immediately will only make it worse and the node will eventually have to fail over. vs pausing briefly may allow the next request to succeed.

Similarly, if the error is OutOfRange or NotFound, retrying immediately is not likely to succeed.

Canceled or DeadlineExceeded are guaranteed to fail all requests if the source was a local context.

DefaultResourceExhaustedMaxRetryDelay = 30 * time.Second
// DefaultMaxRetryDelay is the default max request duration when retrying failed
// gRPC requests to one of the Access Nodes.
DefaultMaxRetryDelay = 30 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this load balancing or failovers? I'm not aware of a demand for load balancing requests to different backends, but there is clear need for failing over when the primary is unavailable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 117deb1 and 667681e.

📒 Files selected for processing (1)
  • tests/integration_test.go (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/integration_test.go (3)
bootstrap/create-multi-key-account.go (1)
  • CreateMultiKeyAccount (70-194)
config/config.go (2)
  • Config (43-127)
  • TxStateValidation (36-36)
bootstrap/bootstrap.go (2)
  • New (83-111)
  • Run (746-762)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test
🔇 Additional comments (5)
tests/integration_test.go (5)

33-33: LGTM! Consistent test helper refactoring.

The signature updates to startEmulator are consistent across all test functions and align with the refactoring to accept explicit server configurations.

Also applies to: 144-144, 249-249, 328-328, 452-452


644-662: Gateway configuration correctly sets up primary and backup ANs.

The configuration properly specifies both AccessNodeHost (primary at port 3569) and AccessNodeBackupHosts (backup at port 3599), enabling the failover behavior to be tested.


686-695: Primary AN validation and shutdown logic is sound.

The test correctly:

  1. Validates that eth_syncing works via the primary AN (lines 686-690)
  2. Explicitly stops the primary emulator to simulate AN unavailability (lines 694-695)

This sets up the conditions for testing backup failover.


697-711: Excellent use of retry loop to handle failover timing.

The assert.Eventually with a 5-second timeout and 500ms retry interval properly addresses the previous review concern about gRPC resolver/load-balancer needing time to detect the primary's unavailability and failover to the backup. This approach eliminates the potential flakiness that would result from an immediate single call after shutdown.


595-711: Well-structured test for backup AN functionality.

The test effectively validates the backup AccessNode failover mechanism:

  • Starts independent primary and backup emulators
  • Configures the gateway with both ANs
  • Validates normal operation via primary
  • Simulates primary failure
  • Confirms backup takes over with appropriate retry logic

Note: The test is appropriately scoped to read-only operations (eth_syncing). The multi-key account is created only on the primary emulator, which is sufficient for this test. If future tests need to validate transaction submission after failover, they would need to ensure account state exists on backup ANs as well.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 199be73 and f04957f.

📒 Files selected for processing (1)
  • services/ingestion/event_subscriber.go (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-03-07T01:35:09.751Z
Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).

Applied to files:

  • services/ingestion/event_subscriber.go
🧬 Code graph analysis (1)
services/ingestion/event_subscriber.go (2)
models/events.go (1)
  • NewBlockEventsError (252-256)
models/errors/errors.go (1)
  • ErrDisconnected (24-24)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test
🔇 Additional comments (1)
services/ingestion/event_subscriber.go (1)

227-227: LGTM! Addition of codes.Unavailable enables proper failover to backup ANs.

This change correctly handles the Unavailable error code, allowing the reconnection logic to trigger the pick_first load balancer to fail over to backup Access Nodes when the primary becomes unavailable.

Based on past review comments and PR objectives.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f04957f and f30b0b3.

📒 Files selected for processing (1)
  • bootstrap/bootstrap.go (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test
🔇 Additional comments (4)
bootstrap/bootstrap.go (4)

25-26: LGTM! Imports align with manual resolver implementation.

The gRPC resolver imports are correctly added to support the new manual resolver-based dialing for backup Access Nodes.


46-52: LGTM! Constant renaming reflects broader retry scope.

The constants are appropriately renamed from DefaultResourceExhaustedRetryDelay/DefaultResourceExhaustedMaxRetryDelay to DefaultRetryDelay/DefaultMaxRetryDelay, aligning with the expanded retry logic in retryInterceptor that now handles multiple error types beyond just ResourceExhausted.


490-538: Backup Access Node implementation is well-structured (pending resolver lifecycle fix).

The conditional logic correctly distinguishes between configurations with and without backup hosts:

  • Backup path: Uses manual resolver with pick_first load balancing to enable failover.
  • Single-host path: Retains existing direct dial behavior.

The endpoint construction (lines 500-508) correctly places the primary host first, followed by backup hosts, ensuring proper failover order. The target URI format "dns:///flow-access" is correct and engages the manual resolver as intended.

However, the critical issue with defer mr.Close() on line 492 must be addressed before this implementation can function correctly.


592-603: Verify that Unavailable errors should trigger immediate failover without retry.

The updated retry logic treats different error codes distinctly:

  • Canceled/DeadlineExceeded: Non-retriable (correct—guaranteed to fail on local context).
  • ResourceExhausted/OutOfRange/NotFound: Retried with a brief pause (reasonable—gives the AN a chance to recover).
  • Other errors (including Unavailable): Return immediately without retrying on the same backend.

For Unavailable errors (common when an AN is down or unreachable), the current implementation skips retrying on the same backend and allows the pick_first load balancer to immediately switch to a backup AN. This seems intentional for fast failover.

However, based on past review discussions about whether to retry all errors or only specific ones, please confirm this is the intended behavior. Specifically:

  1. Should Unavailable errors be retried with a pause on the same backend first, or immediately handed off to the load balancer for failover?
  2. Are there any other error codes (e.g., Internal, Unknown) that should be explicitly handled?

If the current behavior is correct, consider adding a comment explaining that Unavailable errors trigger immediate failover for resilience:

 		switch status.Code(err) {
 		case codes.Canceled, codes.DeadlineExceeded:
 			// these kind of errors are guaranteed to fail all requests,
 			// if the source was a local context
 			return err
 		case codes.ResourceExhausted, codes.OutOfRange, codes.NotFound:
 			// when we receive these errors, we pause briefly, so that
 			// the next request on the same AN, has a higher chance
 			// of success.
 		default:
+			// For connection errors like Unavailable, return immediately
+			// to allow the load balancer to fail over to a backup AN.
 			return err
 		}

@m-Peter m-Peter force-pushed the mpeter/access-nodes-network-resilience branch from f30b0b3 to 422d66e Compare October 9, 2025 13:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
bootstrap/bootstrap.go (1)

482-518: Keep the manual resolver alive for the client lifetime

Line 484: defer mr.Close() runs as soon as this function returns, tearing down the resolver the client depends on for failover. Remove the defer and close the resolver when the client is shut down instead (e.g., store mr and call Close() from StopClient).

-		mr := manual.NewBuilderWithScheme("dns")
-		defer mr.Close()
+		mr := manual.NewBuilderWithScheme("dns")
+		// Keep this resolver alive until the client is closed.
🧹 Nitpick comments (1)
cmd/run/cmd.go (1)

204-206: Trim whitespace when parsing backup hosts

Line 205: CLI values like "host1, host2" will leave a leading space on the second entry, producing an invalid dial target. Trim the segments before assigning to cfg.AccessNodeBackupHosts.

-	if accessNodeBackupHosts != "" {
-		cfg.AccessNodeBackupHosts = strings.Split(accessNodeBackupHosts, ",")
-	}
+	if accessNodeBackupHosts != "" {
+		hosts := strings.Split(accessNodeBackupHosts, ",")
+		for i := range hosts {
+			hosts[i] = strings.TrimSpace(hosts[i])
+		}
+		cfg.AccessNodeBackupHosts = hosts
+	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f30b0b3 and 422d66e.

📒 Files selected for processing (8)
  • bootstrap/bootstrap.go (4 hunks)
  • cmd/run/cmd.go (3 hunks)
  • config/config.go (1 hunks)
  • services/ingestion/event_subscriber.go (1 hunks)
  • tests/helpers.go (3 hunks)
  • tests/integration_test.go (6 hunks)
  • tests/key_store_release_test.go (1 hunks)
  • tests/tx_batching_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/key_store_release_test.go
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-06T10:14:49.676Z
Learnt from: m-Peter
PR: onflow/flow-evm-gateway#890
File: services/ingestion/event_subscriber.go:235-239
Timestamp: 2025-10-06T10:14:49.676Z
Learning: In services/ingestion/event_subscriber.go, when reconnecting after disconnect errors (DeadlineExceeded, Internal, Unavailable), the subscription should reconnect at lastReceivedHeight rather than lastReceivedHeight+1. This avoids errors when the next height doesn't exist yet, and duplicate event processing is safe because the ingestion engine is explicitly designed to be idempotent (storage uses batch.Set() which overwrites existing entries).

Applied to files:

  • services/ingestion/event_subscriber.go
📚 Learning: 2025-03-07T01:35:09.751Z
Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).

Applied to files:

  • services/ingestion/event_subscriber.go
🧬 Code graph analysis (2)
services/ingestion/event_subscriber.go (2)
models/events.go (1)
  • NewBlockEventsError (252-256)
models/errors/errors.go (1)
  • ErrDisconnected (24-24)
tests/integration_test.go (3)
bootstrap/create-multi-key-account.go (1)
  • CreateMultiKeyAccount (70-194)
config/config.go (3)
  • Config (43-127)
  • TxStateValidation (36-36)
  • LocalIndexValidation (39-39)
bootstrap/bootstrap.go (2)
  • New (85-113)
  • Run (753-769)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test
🔇 Additional comments (7)
tests/helpers.go (2)

70-93: LGTM! Clean refactoring for test configuration reusability.

Extracting the server configuration into defaultServerConfig() improves maintainability and allows tests to provide custom configurations when needed. Using panic for error handling is acceptable in test helpers, though it differs from typical Go patterns.


95-119: LGTM! Signature change enables configurable test emulators.

The expanded signature allows tests to provide custom server configurations, supporting scenarios like the new backup AccessNode test. The implementation correctly uses the provided configuration.

tests/integration_test.go (5)

33-33: LGTM! Consistent updates to match new startEmulator signature.

All test functions correctly updated to pass defaultServerConfig() to startEmulator, maintaining consistency across the test suite.

Also applies to: 144-144, 249-249, 328-328, 452-452


596-622: LGTM! Well-structured backup AccessNode test setup.

The test correctly:

  • Sets up primary and backup emulators with non-conflicting ports
  • Maintains separate contexts for independent lifecycle management
  • Addresses past review feedback on context handling

The backup emulator configuration with distinct GRPC/REST/Admin/Debugger ports ensures proper isolation during failover testing.


644-662: LGTM! Gateway configuration correctly includes backup hosts.

The configuration properly wires in the AccessNodeBackupHosts field, demonstrating the new failover functionality introduced in this PR. The backup host correctly references the alternate emulator port.


664-690: LGTM! Proper gateway initialization and primary validation.

The test correctly:

  • Initializes the gateway with backup host configuration
  • Ensures proper cleanup with deferred boot.Stop()
  • Validates the primary AccessNode is serving requests before testing failover

692-711: LGTM! Robust failover validation with proper retry logic.

The test correctly:

  • Shuts down the primary AccessNode to trigger failover
  • Uses assert.Eventually with retry logic to account for gRPC resolver/load balancer detection time (addresses past review feedback)
  • Uses backupCtx for validation since the primary context was cancelled
  • Employs reasonable timeout (5s) and polling interval (500ms) for failover detection

This validates that the backup AccessNode successfully takes over when the primary becomes unavailable, fulfilling the PR's core objective.

@m-Peter m-Peter force-pushed the mpeter/access-nodes-network-resilience branch from 422d66e to ce4808f Compare October 9, 2025 13:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
services/ingestion/event_subscriber.go (1)

235-258: LGTM! Retry loop correctly implements 30-second reconnection window.

The retry logic properly handles reconnection attempts with appropriate timeout and pause intervals. Reconnecting at lastReceivedHeight (not +1) is correct because the ingestion engine is explicitly idempotent.

Based on learnings.

Optional: Remove unused attempts variable.

The attempts variable (line 236) is incremented but never used. If you plan to log or expose retry metrics, retain it; otherwise, consider removing it to reduce clutter.

Apply this diff if the variable is not needed:

-			attempts := 0
 			pauseDuration, maxDuration := 200*time.Millisecond, 30*time.Second
 			// Allow reconnect retries for up to 30 seconds, with retry
 			// attempts every 200 ms.
 			for {
 				err := connect(lastReceivedHeight)
 				if err == nil {
 					break
 				}

-				attempts++
 				duration := time.Since(start)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 422d66e and ce4808f.

📒 Files selected for processing (1)
  • services/ingestion/event_subscriber.go (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-06T10:14:49.676Z
Learnt from: m-Peter
PR: onflow/flow-evm-gateway#890
File: services/ingestion/event_subscriber.go:235-239
Timestamp: 2025-10-06T10:14:49.676Z
Learning: In services/ingestion/event_subscriber.go, when reconnecting after disconnect errors (DeadlineExceeded, Internal, Unavailable), the subscription should reconnect at lastReceivedHeight rather than lastReceivedHeight+1. This avoids errors when the next height doesn't exist yet, and duplicate event processing is safe because the ingestion engine is explicitly designed to be idempotent (storage uses batch.Set() which overwrites existing entries).

Applied to files:

  • services/ingestion/event_subscriber.go
📚 Learning: 2025-03-07T01:35:09.751Z
Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).

Applied to files:

  • services/ingestion/event_subscriber.go
🧬 Code graph analysis (1)
services/ingestion/event_subscriber.go (2)
models/events.go (1)
  • NewBlockEventsError (252-256)
models/errors/errors.go (1)
  • ErrDisconnected (24-24)
🔇 Additional comments (1)
services/ingestion/event_subscriber.go (1)

227-227: LGTM! Enables failover to backup Access Nodes.

Adding codes.Unavailable to the reconnection trigger is correct. When the primary AN becomes unavailable, this ensures the retry loop engages and the pick_first load balancer can select the next available backup host, preventing fatal errors.

Based on learnings and past review discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve resilience on connections with upstream ANs

3 participants