feat: add extractor supervisor with fault tolerance and exponential backoff by zizou0x · Pull Request #1026 · propeller-heads/tycho

zizou0x · 2026-05-19T09:40:21Z

What

Adds ExtractorSupervisor and ExtractorFactory to handle extractor restarts
with exponential backoff, replacing the one-shot ExtractorBuilder pattern.

Why

Any extractor failure would cause the whole indexer process to panic. This
removes that coupling — each extractor is now isolated and restarts
independently on failure without affecting others or the RPC server.

Changes

New ExtractorSupervisor (extractor/supervisor.rs)

Runs a loop: build runner → run → on failure, clear WS subscriptions, signal
PendingDeltas to reset its buffer, wait with exponential backoff, rebuild.
Exposes handle() so callers get an ExtractorHandle without managing the
control channel directly.

New ExtractorFactory (same file)

Async constructor: one RPC call for ChainState, one populate() for
ProtocolMemoryCache. Both are reused across restarts (cache via cheap Arc
clone; chain state is Copy).
build_runner() creates a fresh ProtocolExtractor, DCI plugin, and
Substreams stream per restart, with a fresh CachedGateway instance for
write isolation.

ExtractorRunner simplified

No longer handles control messages. Receives a oneshot stop signal and
shared ws_subscriptions/pending_deltas_tx directly.
WS subscription management and stop handling moved to the supervisor.

Config types moved

ExtractorConfig, ProtocolTypeConfig, DCIType moved from runner.rs
to supervisor.rs. Internal fields are now private; only
initialized_accounts, initialized_accounts_block, and dci_plugin
remain pub.
max_restarts: Option<u32> added to ExtractorConfig (YAML-configurable,
default None = restart forever).

PendingDeltas updated

run() now accepts pre-built receivers and a reset channel instead of
subscribing through extractor handles dynamically. On restart the supervisor
sends the extractor name on the reset channel to clear its buffer.

Supporting changes

CachedGateway::new_instance() — fresh gateway with independent LRU cache
and open_tx, sharing the same write channel and connection pool.
Chain::block_time() — per-chain block time in seconds.
ExtractionError::variant_name() — static label for Prometheus counters.

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

kayibal

🤖 The three-struct split (ExtractorFactory / ExtractorSupervisor / ExtractorRunner) is cleaner than the old builder, but it's worth asking whether ExtractorRunner pulls its weight now. Before this PR, Runner owned both the stream loop and control message handling. After the refactor, control messages moved to the supervisor, and Runner's only remaining job is driving the select! loop over the stream. All of its fields (ws_subscriptions, pending_deltas_tx, stop_rx) are owned by the supervisor and injected per restart — Runner holds no state of its own.

Not necessarily a blocker, but worth a conversation: is the stream loop complex enough to justify a dedicated struct, or would a private run_stream() method on the supervisor give the same isolation with one fewer abstraction layer?

…ackoff Introduces ExtractorSupervisor, which wraps ExtractorFactory and manages the full lifecycle of an extractor: building, running, restarting on failure with exponential backoff, clearing WS subscriptions, and signalling PendingDeltas to reset its buffer between runs. Key design changes: - Replace ExtractorBuilder (one-shot builder pattern) with ExtractorFactory, designed for repeated use across restarts. The factory is async at construction: one RPC call for ChainState, one populate() for ProtocolMemoryCache — both reused across all restarts via clone. - ExtractorRunner is now decoupled from control-message handling. It receives a oneshot stop signal and shared ws_subscriptions/pending_deltas_tx directly. Subscribe and stop are handled by the supervisor. - ExtractorSupervisor creates its own control channel internally and exposes handle() to give callers an ExtractorHandle. max_restarts is moved to ExtractorConfig as Option<u32> (None = restart forever, default). - ExtractorConfig, ProtocolTypeConfig, and DCIType move to supervisor.rs alongside ExtractorFactory. Internal fields are private; only initialized_accounts, initialized_accounts_block, and dci_plugin remain pub. - PendingDeltas.run() now accepts pre-built receivers and a reset channel rather than subscribing dynamically through extractors. On restart the supervisor sends the extractor name on the reset channel so PendingDeltas can clear its buffer. - CachedGateway gains new_instance() for creating fresh per-restart gateway instances with independent LRU cache and open_tx. - Chain::block_time() added to tycho-common for per-chain block time estimation. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Replace two unwrap() calls with ? so that S3 body read failures and file write failures surface as errors rather than panics in production. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

new() conventionally signals a cheap, synchronous, infallible constructor. This one makes an RPC call and populates a DB-backed cache, so create() better reflects what the caller should expect. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…extractor channel Introduce DeltaCommand { Block(ExtractorMsg), ExtractorRestarted(String) } sent over the same channel as block messages. PendingDeltas now handles buffer resets in-band, which guarantees ExtractorRestarted is always processed after the last block the runner emitted before stopping. Removes the separate reset_tx: Sender<String> channel and its consumers in ServicesBuilder and PendingDeltas::run. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ry.rs supervisor.rs pulled in aws_sdk_s3, prost, all DCI types, and the full Substreams construction stack purely because the factory lived there. Moving ExtractorFactory, ExtractorConfig, ProtocolTypeConfig, DCIType, ensure_spkg, and download_file_from_s3 into factory.rs keeps each file focused on one thing. supervisor.rs is now ~200 lines covering only restart lifecycle. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…loses When the supervisor drops WS subscriptions on extractor restart, the stream in WsActor silently ended and the default StreamHandler::finished() called ctx.stop(), closing the entire WebSocket connection for all subscriptions — including those to unaffected extractors. Add a private ExtractorEvent enum (Message | ChannelClosed) yielded by each subscription stream. The ChannelClosed variant triggers SubscriptionEnded to the client and cleans up that subscription's state. finished() closes the connection only if no subscriptions remain, preserving the connection when other extractors are still active. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

kayibal · 2026-05-29T08:54:13Z

+            // Exponential backoff: 120s, 240s, 480s, 960s, 1920s, 3840s, 7680s, 14400s
+            // (capped at 4 hours).
+            let exp = restart_count.min(7); // 120 * 2^7 = 14400s = 4 hours, cap here to avoid overflow.
+            let backoff = std::time::Duration::from_secs(120 * 2u64.pow(exp));


🤖 Backoff starts at 120s — first retry is 2 minutes

The first restart delay is 120 * 2^0 = 120s regardless of failure cause. A momentary network blip or DB connection drop will keep the extractor dark for 2 minutes before the first retry.

Typical supervisory backoffs start at 1–10s (with optional jitter) and cap at minutes/hours — not start there. Consider 5s, 10s, 20s, 40s, ..., capped at 3600s. The 4-hour cap is reasonable; it's the floor that's too high.

I agree we should immediately restart on the first error. So I'd change this to be 1s or so initially.

We already depend on tokio-retry in this crate, we could reuse it instead of implementing our own exponential backoff:

https://docs.rs/tokio-retry/latest/tokio_retry/

kayibal · 2026-05-29T08:54:14Z

+            }
+        }
+
        // TODO: rename variable here instead


🤖 Stale TODO + wrong variable name — clean up now

arced_message is not an Arc — it's just the message moved into a new binding. This rename has been deferred for a while and the surrounding function was significantly refactored in this PR, making it a natural moment to finish the job.

It's actually an Arc here but indeed we should probably just do the renaming now.

kayibal · 2026-05-29T08:54:14Z

+        .await?;
+
+        // Create dedicated PendingDeltas channel for this extractor
+        let (pd_tx, pd_rx) = tokio::sync::mpsc::channel(256);


🤖 Magic number for PendingDeltas channel buffer

256 is a tunable that affects backpressure and memory usage under load. Embedding it as a literal makes it easy to overlook and inconsistent with the 128 used for the supervisor control channel. A named constant makes the intent visible and the value easy to adjust.

kayibal

I think removing websocket knowledge out of the supervisor should still happen before we merge this.

kayibal · 2026-05-29T09:15:05Z

+        ws_subscriptions: Arc<Mutex<SubscriptionsMap>>,
+        pending_deltas_tx: Option<Sender<DeltaCommand>>,


Wait so PendingDeltas, used to operate with the standard subscription mechnism and now we use something separate?

kayibal · 2026-05-29T09:16:49Z

+    stop_rx: oneshot::Receiver<()>,
    /// Handle of the tokio runtime on which the extraction tasks will be run.
-    /// If 'None' the default runtime will be used.
+    /// If `None` the default runtime will be used.


Would be good to mention here in the comment that this refers to the tokio runtime.

kayibal · 2026-05-29T09:25:25Z

+    ws_subscriptions: Arc<Mutex<SubscriptionsMap>>,
+    /// Dedicated channel for PendingDeltasBuffer — survives restarts.
+    pending_deltas_tx: Option<Sender<DeltaCommand>>,
+    /// Oneshot stop signal from the supervisor.


I don't like that these are split now. I also dislike that to correctly model the old behaviour the pending_deltas_tx has to be optional. I think ideally DeltaCommand should be moved into SubscriptionMap. Then any other component that requires it independently if internal or external can subscribe to it via subscriptions.

kayibal · 2026-05-29T09:28:32Z

+            // Clear WS subscriptions — clients must reconnect after a restart.
+            // TODO: can we keep the ws connections alive and handle this on the client side?
+            {
+                let mut subs = self.ws_subscriptions.lock().await;


I really dislike that the Supervisor and the ExtractorRunner now "know" about websocket subscriptions and treat them differently than PendingDelta subscriptions, previously the only concept was to have subscribers. In my opinion this can be kept now that we introduced DeltaCommand type.

Atm you are forcefully clearing the websocket subscriptions, making the normal subscription service potentially unusable for any other internal future use case. I think the connection restart should be decided at the websocket layer not here. So services::ws should instead receive a DeltaCommand::ExtractoRestarted and act upon it as it thinks is best.

github-project-automation Bot added this to Tycho May 19, 2026

github-project-automation Bot moved this to Todo in Tycho May 19, 2026

claude Bot reviewed May 19, 2026

View reviewed changes

kayibal reviewed May 19, 2026

View reviewed changes

Comment thread crates/tycho-indexer/src/extractor/supervisor.rs Outdated

kayibal reviewed May 19, 2026

View reviewed changes

Comment thread crates/tycho-indexer/src/extractor/supervisor.rs Outdated

kayibal reviewed May 19, 2026

View reviewed changes

Comment thread crates/tycho-indexer/src/extractor/supervisor.rs Outdated

kayibal reviewed May 19, 2026

View reviewed changes

Comment thread crates/tycho-indexer/src/extractor/supervisor.rs Outdated

zizou0x and others added 6 commits May 21, 2026 09:54

fix: propagate errors in download_file_from_s3 instead of panicking

8a1e715

Replace two unwrap() calls with ? so that S3 body read failures and file write failures surface as errors rather than panics in production. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

zizou0x force-pushed the zz/extractor-fault-tolerance branch from afc0fb0 to a076ecf Compare May 21, 2026 07:57

refactor: fmt and clippy suggestions

eac2994

kayibal reviewed May 29, 2026

View reviewed changes

		ws_subscriptions: Arc<Mutex<SubscriptionsMap>>,
		pending_deltas_tx: Option<Sender<DeltaCommand>>,

Conversation

zizou0x commented May 19, 2026

What

Why

Changes

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

kayibal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kayibal May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kayibal May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kayibal May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kayibal May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kayibal May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kayibal left a comment

Choose a reason for hiding this comment

Uh oh!

kayibal May 29, 2026

Choose a reason for hiding this comment

Uh oh!

kayibal May 29, 2026

Choose a reason for hiding this comment

Uh oh!

kayibal May 29, 2026

Choose a reason for hiding this comment

Uh oh!

kayibal May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kayibal May 29, 2026 •

edited

Loading

kayibal May 29, 2026 •

edited

Loading

kayibal May 29, 2026 •

edited

Loading

kayibal May 29, 2026 •

edited

Loading