- 
                Notifications
    You must be signed in to change notification settings 
- Fork 12
Allow setting back-up ANs for networking resilience #890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| WalkthroughAdds support for dialing Access Nodes via a manual gRPC resolver (primary + backups with pick_first LB), updates retry/interceptor constants and behavior, makes subscriber reconnects retry with a 30s window, adds CLI/config for backup hosts, and refactors tests to accept an explicit server config plus a new integration test for AN failover. Changes
 Sequence Diagram(s)sequenceDiagram
  autonumber
  participant App as Gateway Bootstrap
  participant CFG as Config
  participant R as Manual Resolver
  participant GRPC as gRPC Client
  participant AN as Access Nodes
  Note over CFG: AccessNodeHost + optional AccessNodeBackupHosts
  App->>CFG: Read AccessNodeHost, BackupHosts
  alt BackupHosts provided
    App->>R: Build endpoints list (primary + backups)
    R-->>GRPC: Provide resolver InitialState (endpoints)
    App->>GRPC: Dial with resolver + service config (pick_first) + retry interceptor
    GRPC->>AN: Connect via LB across endpoints
  else No backups
    App->>GRPC: Dial single host (legacy path)
    GRPC->>AN: Connect to primary only
  end
sequenceDiagram
  autonumber
  participant Sub as RPCEventSubscriber
  participant AN as Access Node
  participant Timer as Timer/Backoff
  Sub->>AN: Subscribe / Stream
  AN--xSub: Error (DeadlineExceeded / Internal / Unavailable / NotFound)
  alt NotFound
    Sub->>Timer: Short wait
    Timer-->>Sub: Retry subscribe
  else Transient error
    loop Retry loop (up to 30s)
      Sub->>Timer: Wait 200ms
      Timer-->>Sub: Retry connect(lastReceivedHeight)
      Sub->>AN: Attempt reconnect/resubscribe
      AN-->>Sub: Success or continue loop
    end
    alt Timeout reached
      Sub->>Sub: Return BlockEventsError(lastHeight, lastErr)
    end
  end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
 Suggested reviewers
 Poem
 Pre-merge checks and finishing touches❌ Failed checks (1 warning)
 ✅ Passed checks (4 passed)
 ✨ Finishing touches
 🧪 Generate unit tests (beta)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
2baf5d5    to
    b9c76eb      
    Compare
  
    87552da    to
    2883e0e      
    Compare
  
    | // next block is finalized. just wait briefly and try again | ||
| time.Sleep(200 * time.Millisecond) | ||
| case codes.DeadlineExceeded, codes.Internal: | ||
| case codes.DeadlineExceeded, codes.Internal, codes.Unavailable: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During some local testing with having 2 Flow Emulator processes, when I killed the 1st process, which was configured as the main AN(AccessNodeHost), then the EVM GW crashed with:
failure in event subscription with: recoverable: disconnected:
error receiving event: rpc error:
code = Unavailable desc = error reading from server: EOFAdding the codes.Unavailable case, solved this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Unavailable is different since it is unlikely that the node will suddenly be available on reconnect. I think if we receive Unavailable, it should failover to the new node. we could let it retry for some period first, but at the expense of delaying data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really sure how SubscribeEventsByBlockHeight is implemented under the hood, but I've observed that it doesn't go through the retryInterceptor, unlike other AN calls. For example:
[METHOD]:  /flow.access.AccessAPI/GetLatestBlock
[METHOD]:  /flow.access.AccessAPI/GetAccountAtLatestBlock
[METHOD]:  /flow.access.AccessAPI/SendTransaction
[METHOD]:  /flow.access.AccessAPI/ExecuteScriptAtLatestBlockI have verified though, that we do need codes.Unavailable in the switch case above, so that it triggers a reconnect with:
if err := connect(lastReceivedHeight); err != nil {
	eventsChan <- models.NewBlockEventsError(
		fmt.Errorf(
			"failed to resubscribe for events on height: %d, with: %w",
			lastReceivedHeight,
			err,
		),
	)
	return
}Note that if we trigger a reconnect, this will prompt the pick_first load balancer to search for the next node which can serve the given request, even if it is unlikely that the current node will suddenly be available.
This will save the EVM Gateway from a fatal error, if the configured backup ANs are indeed available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cmd/run/cmd.go (1)
204-206: Trim backup host inputs to avoid whitespace-induced dial failures.Users often pass comma-separated lists with spaces (
"host-1.com, host-2.com"). Without trimming, we end up dialing" host-2.com", which gRPC will reject. Filtering out empty/whitespace-only entries at parse time makes the new flag much harder to misconfigure.if accessNodeBackupHosts != "" { - cfg.AccessNodeBackupHosts = strings.Split(accessNodeBackupHosts, ",") + var hosts []string + for _, host := range strings.Split(accessNodeBackupHosts, ",") { + host = strings.TrimSpace(host) + if host == "" { + continue + } + hosts = append(hosts, host) + } + cfg.AccessNodeBackupHosts = hosts }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
- bootstrap/bootstrap.go(3 hunks)
- cmd/run/cmd.go(3 hunks)
- config/config.go(1 hunks)
- services/ingestion/event_subscriber.go(1 hunks)
- tests/helpers.go(3 hunks)
- tests/integration_test.go(6 hunks)
- tests/key_store_release_test.go(1 hunks)
- tests/tx_batching_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/integration_test.go (3)
bootstrap/create-multi-key-account.go (1)
CreateMultiKeyAccount(70-194)config/config.go (2)
Config(43-127)
TxStateValidation(36-36)bootstrap/bootstrap.go (2)
New(83-111)
Run(740-756)
| DefaultResourceExhaustedMaxRetryDelay = 30 * time.Second | ||
| // DefaultMaxRetryDelay is the default max request duration when retrying failed | ||
| // gRPC requests to one of the Access Nodes. | ||
| DefaultMaxRetryDelay = 30 * time.Second | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we are adding load-balancing functionality, should we maybe decrease the max retry duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this load balancing or failovers? I'm not aware of a demand for load balancing requests to different backends, but there is clear need for failing over when the primary is unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we use pick_first as the load balancing strategy, effectively this works as a failover mechanism. Because it will stick to the same backend, until that backend is unable to serve any requests (due to connectivity issues), in which case it will pick the next available backend.
I only changed the name of the constant, from DefaultResourceExhaustedMaxRetryDelay to DefaultMaxRetryDelay, because previously the retryInterceptor would only retry on ResourceExhausted errors.
But I've updated that condition in f30b0b3, to account for more related errors that we can retry on the same AN.
| return nil | ||
| } | ||
|  | ||
| if status.Code(err) != codes.ResourceExhausted { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we added the load balancing config:
`{"loadBalancingConfig": [{"pick_first":{}}]}`I removed this error code check entirely.
The reason being, if we receive any kind of gRPC error from one of the ANs:
- The request will be retried for the max specified duration on the same AN
- Then the configured pick_firstload-balancing strategy, will try the next ANs, until it finds one that responds without an error
However, I just noticed that the retryInterceptor is used even when there isn't any configured back-up ANs.
Should we just change the DefaultMaxRetryDelay instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to customize the behavior?
for instance, if the backend returned ResourceExhausted, retrying immediately will only make it worse and the node will eventually have to fail over. vs pausing briefly may allow the next request to succeed.
Similarly, if the error is OutOfRange or NotFound, retrying immediately is not likely to succeed.
Canceled or DeadlineExceeded are guaranteed to fail all requests if the source was a local context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason being, if we receive any kind of gRPC error from one of the ANs:
- The request will be retried for the max specified duration on the same AN
- Then the configured
pick_firstload-balancing strategy, will try the next ANs, until it finds one that responds without an error
Sorry about that, but I had a misconception about how these 2 relate to each other.
The pick_first load-balancing strategy only checks the node's connectivity state ( Ready / Connecting / Idle / TransientFailure etc ), and that's the only signal it uses in order to connect to the next node.
The errors that might be returned from specific node API calls do not really affect the node selection in any way, thus we can't direct the load-balancer when to connect to a different node inside the retryInterceptor.
So the retryInterceptor and the pick_first load-balancing strategy don't get in each other's feet.
I have added some error handling in retryInterceptor in f30b0b3 , to include the errors you mentioned above.
Is that what you had in mind?
779637a    to
    7306e79      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (6)
tests/tx_batching_test.go (5)
42-56: Fix:rangeover int does not compile
for i := range totalTxsis invalid;rangecannot iterate an int. Use an index loop or iterate the slice you allocated.- for i := range totalTxs { + for i := 0; i < totalTxs; i++ { signed, _, err := evmSign(big.NewInt(10), 21000, eoaKey, nonce, &testAddr, nil) … nonce += 1 }
126-151: Fix: invalid loop over int in goroutine blockSame issue; replace
for range totalTxswith an index loop.- for range totalTxs { + for i := 0; i < totalTxs; i++ { signed, _, err := evmSign( big.NewInt(transferAmount), 23_500, privateKey, nonce, &testEoaReceiver, nil, ) … nonce += 1 }
227-242: Fix: invalidrangeover literaluint64(2)Use a counted loop.
- for i := range uint64(2) { + for i := uint64(0); i < 2; i++ { signed, _, err := evmSign( big.NewInt(500_000), 23_500, privatekey, i, &testEoaReceiver, nil, )
336-356: Fix: invalidrangeover literaluint64(2)Same pattern; replace with counted loop.
- for i := range uint64(2) { + for i := uint64(0); i < 2; i++ {
450-471: Fix: invalidrangeover literaluint64(2)Same correction as above.
- for i := range uint64(2) { + for i := uint64(0); i < 2; i++ {bootstrap/bootstrap.go (1)
566-598: Scope retries to transient errors and add exponential backoff with jitterRetrying on all errors for up to 30s can mask permanent failures (e.g., InvalidArgument, NotFound), inflate latency, and increase load. Limit to transient statuses and add backoff+jitter. Also consider reducing max duration now that failover is handled by LB.
-func retryInterceptor(maxDuration, pauseDuration time.Duration) grpcOpts.UnaryClientInterceptor { +func retryInterceptor(maxDuration, basePause time.Duration) grpcOpts.UnaryClientInterceptor { return func(ctx context.Context, method string, req, reply any, cc *grpcOpts.ClientConn, invoker grpcOpts.UnaryInvoker, opts ...grpcOpts.CallOption) error { start := time.Now() attempts := 0 for { err := invoker(ctx, method, req, reply, cc, opts...) if err == nil { return nil } + // Only retry on transient codes. + st, _ := status.FromError(err) + switch st.Code() { + case codes.Unavailable, codes.DeadlineExceeded, codes.ResourceExhausted, codes.Aborted, codes.Internal: + // retry + default: + return err + } attempts++ duration := time.Since(start) if duration >= maxDuration { return fmt.Errorf("request failed (attempts: %d, duration: %v): %w", attempts, duration, err) } - select { + // Exponential backoff with jitter (cap at 2s). + pause := time.Duration(math.Min(float64(2*time.Second), float64(basePause)*math.Pow(2, float64(attempts-1)))) + jitter := time.Duration(rand.Int63n(int64(pause / 2))) + wait := pause/2 + jitter + select { case <-ctx.Done(): return ctx.Err() - case <-time.After(pauseDuration): + case <-time.After(wait): } } } }Also consider
DefaultMaxRetryDelay→ 5–10s now thatpick_firstwill fail over to backups. Do you want me to open a small follow-up PR with these changes and benchmarks?
♻️ Duplicate comments (1)
bootstrap/bootstrap.go (1)
488-521: Manual resolver + pick_first wiring looks correct nowTarget uses the resolver’s scheme with
scheme:///authority, resolver state initialized with ordered endpoints, andWithResolvers(mr)+ service config applied. This fixes the earlier issue where the manual resolver wasn’t engaged.
🧹 Nitpick comments (6)
services/ingestion/event_subscriber.go (1)
222-245: Add small backoff/jitter before reconnecting on transient errorsTo avoid tight reconnect loops (especially when multiple subscribers run), add a brief randomized delay for the transient cases (DeadlineExceeded/Internal/Unavailable).
- case codes.DeadlineExceeded, codes.Internal, codes.Unavailable: - // these are sometimes returned when the stream is disconnected by a middleware or the server + case codes.DeadlineExceeded, codes.Internal, codes.Unavailable: + // transient disconnects; pause briefly with jitter to reduce thrash + time.Sleep(150*time.Millisecond + time.Duration(rand.Intn(200))*time.Millisecond)tests/helpers.go (2)
95-119: Prefer readiness check over fixed sleep for emulator start
time.Sleep(1s)is flaky on slow CI. Poll emulator readiness (e.g., SDKPingor wait for gRPC port) before proceeding.go func() { srv.Start() }() - time.Sleep(1000 * time.Millisecond) + // Wait until the emulator responds to Ping (up to 5s) + client, _ := grpc.NewClient("localhost:3569") + require.Eventually(nil, func() bool { return client.Ping(context.Background()) == nil }, 5*time.Second, 100*time.Millisecond)
418-427: Close HTTP response body to avoid leaksAdd
defer res.Body.Close()afterDo(req)to free resources promptly.res, err := http.DefaultClient.Do(req) if err != nil { return nil, err } +defer res.Body.Close()bootstrap/bootstrap.go (2)
44-51: Naming nit: these control total retry duration and pause, not “delay”Consider
DefaultMaxRetryDurationandDefaultRetryIntervalto reflect usage.
507-519: Optional: tune connect/backoff params for faster failoverYou can bound dial backoff to speed up switching when the first endpoint flaps.
grpcOpts.WithGRPCDialOptions( grpcOpts.WithDefaultCallOptions(grpcOpts.MaxCallRecvMsgSize(DefaultMaxMessageSize)), grpcOpts.WithResolvers(mr), grpcOpts.WithDefaultServiceConfig(json), + grpcOpts.WithConnectParams(grpcOpts.ConnectParams{ + MinConnectTimeout: 2 * time.Second, + Backoff: backoff.Config{BaseDelay: 200 * time.Millisecond, Multiplier: 1.6, Jitter: 0.2, MaxDelay: 2 * time.Second}, + }), grpcOpts.WithUnaryInterceptor(retryInterceptor( DefaultMaxRetryDelay, DefaultRetryDelay, )), )tests/integration_test.go (1)
617-621: Unused context variable.The
backupCancelfunction is created but never used; the returned context fromWithCancelis discarded. While the cleanup viabackupSrv.Stop()is sufficient, this creates an unused variable.Apply this diff to simplify:
- _, backupCancel := context.WithCancel(context.Background()) - defer func() { - backupCancel() - backupSrv.Stop() - }() + defer backupSrv.Stop()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
- bootstrap/bootstrap.go(3 hunks)
- cmd/run/cmd.go(3 hunks)
- config/config.go(1 hunks)
- services/ingestion/event_subscriber.go(1 hunks)
- tests/helpers.go(3 hunks)
- tests/integration_test.go(6 hunks)
- tests/key_store_release_test.go(1 hunks)
- tests/tx_batching_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/run/cmd.go
- config/config.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-03-07T01:35:09.751Z
Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).
Applied to files:
- services/ingestion/event_subscriber.go
🧬 Code graph analysis (1)
tests/integration_test.go (3)
bootstrap/create-multi-key-account.go (1)
CreateMultiKeyAccount(70-194)config/config.go (2)
Config(43-127)
TxStateValidation(36-36)bootstrap/bootstrap.go (2)
New(83-111)
Run(741-757)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test
🔇 Additional comments (9)
tests/tx_batching_test.go (1)
518-518: LGTM: explicit emulator server configSwitching to
startEmulator(true, defaultServerConfig())aligns with the new helper signature and improves clarity.services/ingestion/event_subscriber.go (1)
227-233: Good: treat Unavailable as transientIncluding
codes.Unavailablein the reconnect path prevents crashes during AN restarts/disconnects. Matches the observed failure mode.tests/helpers.go (1)
70-94: Panic on config init is acceptable in testsUsing
panicfor invariant failures during test server config/setup is fine here. Based on learnings.tests/key_store_release_test.go (1)
23-23: LGTM: adopt new emulator helper signatureUsing
startEmulator(true, defaultServerConfig())keeps tests consistent with the new helper.bootstrap/bootstrap.go (1)
24-26: grpc-go version compatibility confirmed
go.mod declaresgoogle.golang.org/grpc v1.75.0, which is ≥ v1.58 and includesresolver.State.Endpoints.tests/integration_test.go (4)
33-33: LGTM: Centralized server configuration.Consistent adoption of
defaultServerConfig()across existing tests improves maintainability by centralizing emulator configuration logic.Also applies to: 144-144, 249-249, 328-328, 452-452
596-607: LGTM: Proper test setup and teardown.Standard test initialization with context cancellation and server cleanup properly deferred.
623-642: LGTM: Standard account creation pattern.Multi-key account creation follows the established pattern from other integration tests.
644-677: LGTM: Proper gateway bootstrap with backup ANs.Configuration correctly includes
AccessNodeBackupHostsand follows the standard bootstrap pattern with proper lifecycle management.
| // next block is finalized. just wait briefly and try again | ||
| time.Sleep(200 * time.Millisecond) | ||
| case codes.DeadlineExceeded, codes.Internal: | ||
| case codes.DeadlineExceeded, codes.Internal, codes.Unavailable: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Unavailable is different since it is unlikely that the node will suddenly be available on reconnect. I think if we receive Unavailable, it should failover to the new node. we could let it retry for some period first, but at the expense of delaying data.
| mr := manual.NewBuilderWithScheme("dns") | ||
| defer mr.Close() | ||
|  | ||
| json := `{"loadBalancingConfig": [{"pick_first":{}}]}` | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you document what the expected behavior is? I'm assuming that pick_first means it will always use the first one unless it fails, then it will go to the next. it would be good to make it clear here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've added comments in 117deb1, with a link to the official example: https://github.com/grpc/grpc-go/tree/master/examples/features/load_balancing#pick_first .
| return nil | ||
| } | ||
|  | ||
| if status.Code(err) != codes.ResourceExhausted { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to customize the behavior?
for instance, if the backend returned ResourceExhausted, retrying immediately will only make it worse and the node will eventually have to fail over. vs pausing briefly may allow the next request to succeed.
Similarly, if the error is OutOfRange or NotFound, retrying immediately is not likely to succeed.
Canceled or DeadlineExceeded are guaranteed to fail all requests if the source was a local context.
| DefaultResourceExhaustedMaxRetryDelay = 30 * time.Second | ||
| // DefaultMaxRetryDelay is the default max request duration when retrying failed | ||
| // gRPC requests to one of the Access Nodes. | ||
| DefaultMaxRetryDelay = 30 * time.Second | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this load balancing or failovers? I'm not aware of a demand for load balancing requests to different backends, but there is clear need for failing over when the primary is unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- tests/integration_test.go(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/integration_test.go (3)
bootstrap/create-multi-key-account.go (1)
CreateMultiKeyAccount(70-194)config/config.go (2)
Config(43-127)
TxStateValidation(36-36)bootstrap/bootstrap.go (2)
New(83-111)
Run(746-762)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test
🔇 Additional comments (5)
tests/integration_test.go (5)
33-33: LGTM! Consistent test helper refactoring.The signature updates to
startEmulatorare consistent across all test functions and align with the refactoring to accept explicit server configurations.Also applies to: 144-144, 249-249, 328-328, 452-452
644-662: Gateway configuration correctly sets up primary and backup ANs.The configuration properly specifies both
AccessNodeHost(primary at port 3569) andAccessNodeBackupHosts(backup at port 3599), enabling the failover behavior to be tested.
686-695: Primary AN validation and shutdown logic is sound.The test correctly:
- Validates that
eth_syncingworks via the primary AN (lines 686-690)- Explicitly stops the primary emulator to simulate AN unavailability (lines 694-695)
This sets up the conditions for testing backup failover.
697-711: Excellent use of retry loop to handle failover timing.The
assert.Eventuallywith a 5-second timeout and 500ms retry interval properly addresses the previous review concern about gRPC resolver/load-balancer needing time to detect the primary's unavailability and failover to the backup. This approach eliminates the potential flakiness that would result from an immediate single call after shutdown.
595-711: Well-structured test for backup AN functionality.The test effectively validates the backup AccessNode failover mechanism:
- Starts independent primary and backup emulators
- Configures the gateway with both ANs
- Validates normal operation via primary
- Simulates primary failure
- Confirms backup takes over with appropriate retry logic
Note: The test is appropriately scoped to read-only operations (
eth_syncing). The multi-key account is created only on the primary emulator, which is sufficient for this test. If future tests need to validate transaction submission after failover, they would need to ensure account state exists on backup ANs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- services/ingestion/event_subscriber.go(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-03-07T01:35:09.751Z
Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).
Applied to files:
- services/ingestion/event_subscriber.go
🧬 Code graph analysis (1)
services/ingestion/event_subscriber.go (2)
models/events.go (1)
NewBlockEventsError(252-256)models/errors/errors.go (1)
ErrDisconnected(24-24)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test
🔇 Additional comments (1)
services/ingestion/event_subscriber.go (1)
227-227: LGTM! Addition ofcodes.Unavailableenables proper failover to backup ANs.This change correctly handles the
Unavailableerror code, allowing the reconnection logic to trigger thepick_firstload balancer to fail over to backup Access Nodes when the primary becomes unavailable.Based on past review comments and PR objectives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- bootstrap/bootstrap.go(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test
🔇 Additional comments (4)
bootstrap/bootstrap.go (4)
25-26: LGTM! Imports align with manual resolver implementation.The gRPC resolver imports are correctly added to support the new manual resolver-based dialing for backup Access Nodes.
46-52: LGTM! Constant renaming reflects broader retry scope.The constants are appropriately renamed from
DefaultResourceExhaustedRetryDelay/DefaultResourceExhaustedMaxRetryDelaytoDefaultRetryDelay/DefaultMaxRetryDelay, aligning with the expanded retry logic inretryInterceptorthat now handles multiple error types beyond justResourceExhausted.
490-538: Backup Access Node implementation is well-structured (pending resolver lifecycle fix).The conditional logic correctly distinguishes between configurations with and without backup hosts:
- Backup path: Uses manual resolver with
pick_firstload balancing to enable failover.- Single-host path: Retains existing direct dial behavior.
The endpoint construction (lines 500-508) correctly places the primary host first, followed by backup hosts, ensuring proper failover order. The target URI format
"dns:///flow-access"is correct and engages the manual resolver as intended.However, the critical issue with
defer mr.Close()on line 492 must be addressed before this implementation can function correctly.
592-603: Verify thatUnavailableerrors should trigger immediate failover without retry.The updated retry logic treats different error codes distinctly:
Canceled/DeadlineExceeded: Non-retriable (correct—guaranteed to fail on local context).
ResourceExhausted/OutOfRange/NotFound: Retried with a brief pause (reasonable—gives the AN a chance to recover).- Other errors (including
Unavailable): Return immediately without retrying on the same backend.For
Unavailableerrors (common when an AN is down or unreachable), the current implementation skips retrying on the same backend and allows thepick_firstload balancer to immediately switch to a backup AN. This seems intentional for fast failover.However, based on past review discussions about whether to retry all errors or only specific ones, please confirm this is the intended behavior. Specifically:
- Should
Unavailableerrors be retried with a pause on the same backend first, or immediately handed off to the load balancer for failover?- Are there any other error codes (e.g.,
Internal,Unknown) that should be explicitly handled?If the current behavior is correct, consider adding a comment explaining that
Unavailableerrors trigger immediate failover for resilience:switch status.Code(err) { case codes.Canceled, codes.DeadlineExceeded: // these kind of errors are guaranteed to fail all requests, // if the source was a local context return err case codes.ResourceExhausted, codes.OutOfRange, codes.NotFound: // when we receive these errors, we pause briefly, so that // the next request on the same AN, has a higher chance // of success. default: + // For connection errors like Unavailable, return immediately + // to allow the load balancer to fail over to a backup AN. return err }
f30b0b3    to
    422d66e      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
bootstrap/bootstrap.go (1)
482-518: Keep the manual resolver alive for the client lifetimeLine 484:
defer mr.Close()runs as soon as this function returns, tearing down the resolver the client depends on for failover. Remove the defer and close the resolver when the client is shut down instead (e.g., storemrand callClose()fromStopClient).- mr := manual.NewBuilderWithScheme("dns") - defer mr.Close() + mr := manual.NewBuilderWithScheme("dns") + // Keep this resolver alive until the client is closed.
🧹 Nitpick comments (1)
cmd/run/cmd.go (1)
204-206: Trim whitespace when parsing backup hostsLine 205: CLI values like
"host1, host2"will leave a leading space on the second entry, producing an invalid dial target. Trim the segments before assigning tocfg.AccessNodeBackupHosts.- if accessNodeBackupHosts != "" { - cfg.AccessNodeBackupHosts = strings.Split(accessNodeBackupHosts, ",") - } + if accessNodeBackupHosts != "" { + hosts := strings.Split(accessNodeBackupHosts, ",") + for i := range hosts { + hosts[i] = strings.TrimSpace(hosts[i]) + } + cfg.AccessNodeBackupHosts = hosts + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
- bootstrap/bootstrap.go(4 hunks)
- cmd/run/cmd.go(3 hunks)
- config/config.go(1 hunks)
- services/ingestion/event_subscriber.go(1 hunks)
- tests/helpers.go(3 hunks)
- tests/integration_test.go(6 hunks)
- tests/key_store_release_test.go(1 hunks)
- tests/tx_batching_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/key_store_release_test.go
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-06T10:14:49.676Z
Learnt from: m-Peter
PR: onflow/flow-evm-gateway#890
File: services/ingestion/event_subscriber.go:235-239
Timestamp: 2025-10-06T10:14:49.676Z
Learning: In services/ingestion/event_subscriber.go, when reconnecting after disconnect errors (DeadlineExceeded, Internal, Unavailable), the subscription should reconnect at lastReceivedHeight rather than lastReceivedHeight+1. This avoids errors when the next height doesn't exist yet, and duplicate event processing is safe because the ingestion engine is explicitly designed to be idempotent (storage uses batch.Set() which overwrites existing entries).
Applied to files:
- services/ingestion/event_subscriber.go
📚 Learning: 2025-03-07T01:35:09.751Z
Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).
Applied to files:
- services/ingestion/event_subscriber.go
🧬 Code graph analysis (2)
services/ingestion/event_subscriber.go (2)
models/events.go (1)
NewBlockEventsError(252-256)models/errors/errors.go (1)
ErrDisconnected(24-24)
tests/integration_test.go (3)
bootstrap/create-multi-key-account.go (1)
CreateMultiKeyAccount(70-194)config/config.go (3)
Config(43-127)
TxStateValidation(36-36)
LocalIndexValidation(39-39)bootstrap/bootstrap.go (2)
New(85-113)
Run(753-769)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test
🔇 Additional comments (7)
tests/helpers.go (2)
70-93: LGTM! Clean refactoring for test configuration reusability.Extracting the server configuration into
defaultServerConfig()improves maintainability and allows tests to provide custom configurations when needed. Usingpanicfor error handling is acceptable in test helpers, though it differs from typical Go patterns.
95-119: LGTM! Signature change enables configurable test emulators.The expanded signature allows tests to provide custom server configurations, supporting scenarios like the new backup AccessNode test. The implementation correctly uses the provided configuration.
tests/integration_test.go (5)
33-33: LGTM! Consistent updates to match new startEmulator signature.All test functions correctly updated to pass
defaultServerConfig()tostartEmulator, maintaining consistency across the test suite.Also applies to: 144-144, 249-249, 328-328, 452-452
596-622: LGTM! Well-structured backup AccessNode test setup.The test correctly:
- Sets up primary and backup emulators with non-conflicting ports
- Maintains separate contexts for independent lifecycle management
- Addresses past review feedback on context handling
The backup emulator configuration with distinct GRPC/REST/Admin/Debugger ports ensures proper isolation during failover testing.
644-662: LGTM! Gateway configuration correctly includes backup hosts.The configuration properly wires in the
AccessNodeBackupHostsfield, demonstrating the new failover functionality introduced in this PR. The backup host correctly references the alternate emulator port.
664-690: LGTM! Proper gateway initialization and primary validation.The test correctly:
- Initializes the gateway with backup host configuration
- Ensures proper cleanup with deferred
boot.Stop()- Validates the primary AccessNode is serving requests before testing failover
692-711: LGTM! Robust failover validation with proper retry logic.The test correctly:
- Shuts down the primary AccessNode to trigger failover
- Uses
assert.Eventuallywith retry logic to account for gRPC resolver/load balancer detection time (addresses past review feedback)- Uses
backupCtxfor validation since the primary context was cancelled- Employs reasonable timeout (5s) and polling interval (500ms) for failover detection
This validates that the backup AccessNode successfully takes over when the primary becomes unavailable, fulfilling the PR's core objective.
422d66e    to
    ce4808f      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
services/ingestion/event_subscriber.go (1)
235-258: LGTM! Retry loop correctly implements 30-second reconnection window.The retry logic properly handles reconnection attempts with appropriate timeout and pause intervals. Reconnecting at
lastReceivedHeight(not +1) is correct because the ingestion engine is explicitly idempotent.Based on learnings.
Optional: Remove unused
attemptsvariable.The
attemptsvariable (line 236) is incremented but never used. If you plan to log or expose retry metrics, retain it; otherwise, consider removing it to reduce clutter.Apply this diff if the variable is not needed:
- attempts := 0 pauseDuration, maxDuration := 200*time.Millisecond, 30*time.Second // Allow reconnect retries for up to 30 seconds, with retry // attempts every 200 ms. for { err := connect(lastReceivedHeight) if err == nil { break } - attempts++ duration := time.Since(start)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- services/ingestion/event_subscriber.go(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-06T10:14:49.676Z
Learnt from: m-Peter
PR: onflow/flow-evm-gateway#890
File: services/ingestion/event_subscriber.go:235-239
Timestamp: 2025-10-06T10:14:49.676Z
Learning: In services/ingestion/event_subscriber.go, when reconnecting after disconnect errors (DeadlineExceeded, Internal, Unavailable), the subscription should reconnect at lastReceivedHeight rather than lastReceivedHeight+1. This avoids errors when the next height doesn't exist yet, and duplicate event processing is safe because the ingestion engine is explicitly designed to be idempotent (storage uses batch.Set() which overwrites existing entries).
Applied to files:
- services/ingestion/event_subscriber.go
📚 Learning: 2025-03-07T01:35:09.751Z
Learnt from: peterargue
PR: onflow/flow-evm-gateway#772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).
Applied to files:
- services/ingestion/event_subscriber.go
🧬 Code graph analysis (1)
services/ingestion/event_subscriber.go (2)
models/events.go (1)
NewBlockEventsError(252-256)models/errors/errors.go (1)
ErrDisconnected(24-24)
🔇 Additional comments (1)
services/ingestion/event_subscriber.go (1)
227-227: LGTM! Enables failover to backup Access Nodes.Adding
codes.Unavailableto the reconnection trigger is correct. When the primary AN becomes unavailable, this ensures the retry loop engages and thepick_firstload balancer can select the next available backup host, preventing fatal errors.Based on learnings and past review discussions.
Closes: #764
Description
I tried this locally, by setting up 2 Flow Emulator processes, on different ports:
1st process:
2nd process:
Then I terminated the 1st process, which was configured to be the main AN (
AccessNodeHost).The EVM Gateway continued operating normally, by directing requests to
AccessNodeBackupHosts, the 2nd process.This was mainly achieved with these 2 gRPC options:
We make use of the
pick_firstbuilt-in client-side load balancer, and we still keep our customretryInterceptor, which will first retry erroneous requests to the same AN, before letting the load balancer pick the first available.For contributor use:
masterbranchFiles changedin the Github PR explorerSummary by CodeRabbit
New Features
Bug Fixes
Tests