fix: move Redis subscriber connect to background task for faster startup by cristiam86 · Pull Request #1538 · genlayerlabs/genlayer-studio

cristiam86 · 2026-03-17T10:27:15Z

Summary

The Redis subscriber's connect() + subscribe() was blocking the Uvicorn lifespan for ~2.5 minutes in production (rally-studio-prd), preventing health probes from being served
Uvicorn doesn't serve HTTP requests until the lifespan yields, so startup/liveness/readiness probes had no server to talk to during this window
Root cause: the blocking health checker (fixed in fix: offload blocking I/O in health checks to prevent event loop freezes #1537) was starving the async Redis connect, turning a sub-second operation into a multi-minute ordeal. Even with fix: offload blocking I/O in health checks to prevent event loop freezes #1537 merged, the Redis connect still blocks the lifespan unnecessarily
Fix: Redis subscriber now connects in a background task with retry logic, so Uvicorn starts serving immediately (~16s instead of ~2.5min)

Evidence from rally-studio-prd logs

10:19:11 - [STARTUP] Application startup completed in 15.69 seconds
10:19:11 - RPC instance rpc-1 Redis subscriber initialized
10:21:38 - RPC instance rpc-1 connected to Redis        ← 2m27s gap!
10:21:38 - Application startup complete.                 ← Uvicorn FINALLY serves
10:21:38 - Uvicorn running on http://0.0.0.0:4000

Test plan

Backend unit tests pass (585 passed)
Deploy and verify Uvicorn starts serving within ~20s
Verify Redis subscriber connects in background and events flow normally
Verify pods stop crash-looping in rally-studio-prd

Summary by CodeRabbit

Refactor
- Redis subscriber now connects in the background during startup, with improved retry handling, pre-registered event handlers, and clearer startup/health logging to reduce startup blocking and improve reliability.
Chores
- Add database migration to create indexes on transaction status (and status+recipient) to speed related queries.

The Redis subscriber's connect() + subscribe() was blocking the Uvicorn lifespan for ~2.5 minutes in production. Since Uvicorn doesn't serve HTTP requests until the lifespan yields, health probes had no server to talk to during this entire window. The root cause: the background health checker (started before Redis connect) runs blocking sync I/O that starves the async Redis connect, turning a sub-second operation into a multi-minute ordeal. Fix: run Redis subscriber connect in a background task so Uvicorn can start serving health probes and RPC requests immediately. Event handlers are registered before connect (stored locally) so no events are missed once the connection completes.

coderabbitai · 2026-03-17T10:27:39Z

📝 Walkthrough

Walkthrough

The change defers Redis subscriber connection to a background task with handler registration occurring before connecting and a single retry on failure; also adds an Alembic migration creating two indexes on the transactions table.

Changes

Cohort / File(s)	Summary
Redis Subscriber Lifecycle Refactoring `backend/protocol_rpc/app_lifespan.py`	Instantiate subscriber and register validator-change handlers before connecting; add `_connect_redis_subscriber` coroutine that attempts connect/start in background with one retry after 5s; schedule as background task and adjust startup logging.
DB Migration — Transactions Indexes `backend/database_handler/migration/versions/e4f8a2b7c913_add_transactions_status_indexes.py`	Add Alembic migration `e4f8a2b7c913` that creates `idx_transactions_status` and `idx_transactions_status_to_address` (if not exists); provide corresponding downgrade to drop them.

Sequence Diagram(s)

sequenceDiagram
    participant App as App Startup
    participant Handler as Handler Registration
    participant BGTask as Background Task
    participant Redis as Redis Subscriber
    participant Retry as Retry Logic

    App->>Handler: Register validator-change handlers
    Handler-->>App: Handlers registered (not yet connected)
    App->>BGTask: Schedule _connect_redis_subscriber
    App-->>App: Log "Redis subscriber connecting in background"
    BGTask->>Redis: Attempt initial connect/start
    alt Connection succeeds
        Redis-->>BGTask: Connected & listening
        BGTask-->>BGTask: Log success
    else Connection fails
        Redis-->>BGTask: Connection error
        BGTask->>Retry: Wait 5 seconds
        Retry->>Redis: Attempt second connect/start
        alt Retry succeeds
            Redis-->>BGTask: Connected & listening
            BGTask-->>BGTask: Log success
        else Retry fails
            Redis-->>BGTask: Connection error
            BGTask-->>BGTask: Log final error
        end
    end

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nudged the handlers in a cozy row,
Then hopped to the background where connections grow,
A five-second pause, a hopeful retry—
Indexes planted, ears up to the sky! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: moving Redis subscriber connection to a background task to enable faster startup.
Description check	✅ Passed	The description covers the main issue (blocking startup), root cause, solution, and provides evidence from production logs; however, it is missing several template sections including 'What', 'Why', 'Testing done', and 'Decisions made'.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/redis-subscriber-blocking-startup

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/protocol_rpc/app_lifespan.py`:
- Around line 395-398: The function handle_validator_change uses an unnecessary
f-string in its logger call and lacks a type hint for the event_data parameter;
remove the leading "f" from the logger.info call so the message is a plain
string and add an appropriate type annotation for event_data (e.g., event_data:
dict or event_data: Any) on the handle_validator_change signature to improve
clarity and typing while leaving the await validators_manager.restart() call
unchanged.
- Around line 414-438: The retry path in _connect_redis_subscriber can leak a
partially-initialized RedisEventSubscriber because
RedisEventSubscriber.connect() assigns self.redis_client before later steps can
fail; before calling connect()/start() in the retry block, ensure any
previously-created client is cleaned up (e.g., if redis_subscriber.redis_client
is not None call the subscriber's close/disconnect/cleanup method or set it to
None after proper shutdown) or change the retry to call redis_subscriber.start()
only (which will internally call connect() only when redis_client is None) so
you don't re-create a second client; update the retry block around
redis_subscriber.connect()/redis_subscriber.start() accordingly to invoke the
proper cleanup or start-only path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc11910a-94d0-4da9-a950-53f939b258f7

📥 Commits

Reviewing files that changed from the base of the PR and between 1e9764a and 824d4c4.

📒 Files selected for processing (1)

backend/protocol_rpc/app_lifespan.py

coderabbitai · 2026-03-17T10:32:29Z

backend/protocol_rpc/app_lifespan.py

+    async def handle_validator_change(event_data):
+        """Reload validators when they change."""
+        logger.info(f"RPC worker reloading validators due to change event")
+        await validators_manager.restart()


⚠️ Potential issue | 🟡 Minor

Remove extraneous f-prefix from string literal.

The f-string on line 397 has no placeholders. Additionally, consider adding a type hint for the event_data parameter.

Proposed fix

- async def handle_validator_change(event_data): + async def handle_validator_change(event_data: dict): """Reload validators when they change.""" - logger.info(f"RPC worker reloading validators due to change event") + logger.info("RPC worker reloading validators due to change event") await validators_manager.restart()

🧰 Tools

🪛 Ruff (0.15.6)

[error] 397-397: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/protocol_rpc/app_lifespan.py` around lines 395 - 398, The function handle_validator_change uses an unnecessary f-string in its logger call and lacks a type hint for the event_data parameter; remove the leading "f" from the logger.info call so the message is a plain string and add an appropriate type annotation for event_data (e.g., event_data: dict or event_data: Any) on the handle_validator_change signature to improve clarity and typing while leaving the await validators_manager.restart() call unchanged.

coderabbitai · 2026-03-17T10:32:29Z

backend/protocol_rpc/app_lifespan.py

+    async def _connect_redis_subscriber():
+        """Connect Redis subscriber in background so Uvicorn starts serving immediately."""
+        try:
+            await redis_subscriber.connect()
+            await redis_subscriber.start()
+            logger.info(
+                f"[STARTUP] Redis subscriber connected at {redis_url} for worker event broadcasting"
+            )
+        except Exception as e:
+            logger.error(
+                f"Failed to connect Redis subscriber at {redis_url}: {e}. "
+                f"Worker events will not be received. Retrying in 5s..."
+            )
+            await asyncio.sleep(5)
+            try:
+                await redis_subscriber.connect()
+                await redis_subscriber.start()
+                logger.info(
+                    f"[STARTUP] Redis subscriber connected at {redis_url} (retry succeeded)"
+                )
+            except Exception as retry_err:
+                logger.error(
+                    f"Redis subscriber retry failed: {retry_err}. "
+                    f"Worker events will NOT be forwarded to WebSocket clients."
+                )


⚠️ Potential issue | 🟡 Minor

Potential connection leak if retry is needed after partial connect failure.

If connect() partially succeeds (e.g., redis_client is assigned but ping() or subscribe() fails), the retry block calls connect() again without closing the existing redis_client. Looking at RedisEventSubscriber.connect(), self.redis_client is assigned before the connection test, so a failure after assignment would leak the first connection object.

Consider adding explicit cleanup before retry, or make the retry only call start() since it internally calls connect() if redis_client is None.

Proposed fix: cleanup before retry

except Exception as e: logger.error( f"Failed to connect Redis subscriber at {redis_url}: {e}. " f"Worker events will not be received. Retrying in 5s..." ) + # Cleanup any partial connection before retry + try: + await redis_subscriber.stop() + except Exception: + pass await asyncio.sleep(5) try: await redis_subscriber.connect() await redis_subscriber.start()

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 422-422: Do not catch blind exception: Exception

(BLE001)

[warning] 434-434: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/protocol_rpc/app_lifespan.py` around lines 414 - 438, The retry path in _connect_redis_subscriber can leak a partially-initialized RedisEventSubscriber because RedisEventSubscriber.connect() assigns self.redis_client before later steps can fail; before calling connect()/start() in the retry block, ensure any previously-created client is cleaned up (e.g., if redis_subscriber.redis_client is not None call the subscriber's close/disconnect/cleanup method or set it to None after proper shutdown) or change the retry to call redis_subscriber.start() only (which will internally call connect() only when redis_client is None) so you don't re-create a second client; update the retry block around redis_subscriber.connect()/redis_subscriber.start() accordingly to invoke the proper cleanup or start-only path.

The transactions table had no index on status, causing every health check query to do a full sequential scan. In production this resulted in 61M seq scans and 5.1T tuple reads, driving the DB to 100% CPU. Adds: - idx_transactions_status (status) - idx_transactions_status_to_address (status, to_address)

coderabbitai

🧹 Nitpick comments (2)

backend/database_handler/migration/versions/e4f8a2b7c913_add_transactions_status_indexes.py (2)

36-38: Add if_exists=True for downgrade resilience.

The downgrade could fail if indexes don't exist (e.g., partial upgrade or manual removal). For consistency with the idempotent upgrade, consider adding if_exists=True.

♻️ Proposed fix

 def downgrade() -> None:
-    op.drop_index("idx_transactions_status_to_address", table_name="transactions")
-    op.drop_index("idx_transactions_status", table_name="transactions")
+    op.drop_index("idx_transactions_status_to_address", table_name="transactions", if_exists=True)
+    op.drop_index("idx_transactions_status", table_name="transactions", if_exists=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@backend/database_handler/migration/versions/e4f8a2b7c913_add_transactions_status_indexes.py`
around lines 36 - 38, In the downgrade function, make dropping indexes resilient
by passing if_exists=True to the op.drop_index calls; update the two calls to
op.drop_index that target "idx_transactions_status_to_address" and
"idx_transactions_status" (with table_name="transactions") so they include
if_exists=True to avoid failing when the indexes are already absent.

21-33: Consider concurrent index creation for large tables.

Creating indexes without CONCURRENTLY can lock writes on the table during migration. If transactions is large or frequently written in production, use postgresql_concurrently=True with autocommit_block() to avoid blocking writes:

from alembic import op

def upgrade() -> None:
    with op.get_context().autocommit_block():
        op.create_index(
            "idx_transactions_status",
            "transactions",
            ["status"],
            if_not_exists=True,
            postgresql_concurrently=True,
        )
    with op.get_context().autocommit_block():
        op.create_index(
            "idx_transactions_status_to_address",
            "transactions",
            ["status", "to_address"],
            if_not_exists=True,
            postgresql_concurrently=True,
        )

def downgrade() -> None:
    with op.get_context().autocommit_block():
        op.drop_index("idx_transactions_status_to_address", table_name="transactions", postgresql_concurrently=True)
    with op.get_context().autocommit_block():
        op.drop_index("idx_transactions_status", table_name="transactions", postgresql_concurrently=True)

This is optional depending on your deployment strategy and table size, but recommended for production tables with heavy write traffic.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@backend/database_handler/migration/versions/e4f8a2b7c913_add_transactions_status_indexes.py`
around lines 21 - 33, The upgrade currently creates indexes on transactions
without using CONCURRENTLY which can block writes; modify upgrade() to create
each index inside op.get_context().autocommit_block() and pass
postgresql_concurrently=True to op.create_index (for "idx_transactions_status"
and "idx_transactions_status_to_address"), and mirror this in downgrade() by
dropping indexes with op.drop_index inside autocommit_block() using
postgresql_concurrently=True and table_name="transactions" to safely remove them
without locking writes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@backend/database_handler/migration/versions/e4f8a2b7c913_add_transactions_status_indexes.py`:
- Around line 36-38: In the downgrade function, make dropping indexes resilient
by passing if_exists=True to the op.drop_index calls; update the two calls to
op.drop_index that target "idx_transactions_status_to_address" and
"idx_transactions_status" (with table_name="transactions") so they include
if_exists=True to avoid failing when the indexes are already absent.
- Around line 21-33: The upgrade currently creates indexes on transactions
without using CONCURRENTLY which can block writes; modify upgrade() to create
each index inside op.get_context().autocommit_block() and pass
postgresql_concurrently=True to op.create_index (for "idx_transactions_status"
and "idx_transactions_status_to_address"), and mirror this in downgrade() by
dropping indexes with op.drop_index inside autocommit_block() using
postgresql_concurrently=True and table_name="transactions" to safely remove them
without locking writes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e79ea66-975a-4169-95b1-b0e0a84524cd

📥 Commits

Reviewing files that changed from the base of the PR and between 824d4c4 and c825a76.

📒 Files selected for processing (1)

backend/database_handler/migration/versions/e4f8a2b7c913_add_transactions_status_indexes.py

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: move Redis subscriber connect to background task for faster startup#1538

fix: move Redis subscriber connect to background task for faster startup#1538
cristiam86 wants to merge 2 commits intomainfrom
fix/redis-subscriber-blocking-startup

cristiam86 commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cristiam86 commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Evidence from rally-studio-prd logs

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cristiam86 commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading