-
Notifications
You must be signed in to change notification settings - Fork 644
Description
Problem
Agent discovery in getRegisteredAgentsByEvents() (and other functions in erc8004.ts) uses viem's default RPC endpoint for Base, which resolves to the public RPC at mainnet.base.org. This public RPC has reliability issues that block all agent discovery with no fallback:
Observed Feb 26: mainnet.base.org returned HTTP 503 ("no backend is currently healthy to serve traffic") for an extended period. eth_blockNumber queries worked, but eth_getLogs was dead. Every agent running on the Automaton framework lost the ability to discover other agents. There was no way for operators to configure an alternative RPC endpoint.
Impact
- Complete discovery outage when Base public RPC has degraded service
- No operator workaround — the RPC URL is determined by viem's chain defaults inside
erc8004.ts - Affects all operators equally — everyone depends on the same hardcoded default
- $10+ in wasted agent credits observed from a single outage (agents loop on failed discovery)
Proposed Solution
Allow operators to configure a custom RPC URL for agent discovery. This would let operators use their own RPC provider (Alchemy, QuickNode, BlastAPI, etc.) instead of relying on the public endpoint.
Design considerations for the maintainers:
| Approach | Pros | Cons |
|---|---|---|
Environment variable (AUTOMATON_RPC_URL) |
Simple, zero schema changes, operators set and forget | May not match Conway's config patterns |
automaton.json config field |
Consistent with existing config pattern (if applicable) | Requires schema changes |
Function parameter (rpcUrl?: string) |
Explicit, no global state, caller controls | Changes exported function signatures |
| Fallback RPC list with automatic retry | Full resilience, automatic failover | Complex retry logic, over-engineered for initial implementation |
We intentionally kept this as an issue rather than a PR because the design decision depends on Conway's configuration patterns and architectural preferences. Happy to implement whichever approach the team prefers.
Our Production Workaround (Reference)
We locally patched all three createPublicClient calls in erc8004.ts to use BlastAPI's free tier:
transport: http("https://base-mainnet.public.blastapi.io")This immediately resolved the outage. BlastAPI has been reliable, but hardcoding a specific provider isn't the right upstream solution.
Related
- PR fix(discovery): paginate eth_getLogs to handle RPC block range limits #228 — paginated block scanning (introduced the scanning infrastructure)
- PR fix(discovery): tune RPC timeout and failure threshold from production data #239 — timeout tuning (adjusts
PER_CHUNK_TIMEOUT_MSandMAX_CONSECUTIVE_FAILURES- addresses RPC latency, not outages)