feat: add Temporal durable execution layer#32
feat: add Temporal durable execution layer#32Bajuzjefe wants to merge 2 commits intomasumi-network:mainfrom
Conversation
Research and plan for 4 scaling contributions to Kodosumi: - PR 1: PostgreSQL support (alongside SQLite) - PR 2: Redis Streams event transport (alongside Ray queue polling) - PR 3: Temporal durable execution (alongside direct execution) - PR 4: Docker Compose development environment All features are opt-in via config with zero impact on default behavior. Addresses upstream issues masumi-network#8 (scale spooler) and masumi-network#11 (provide containers).
Add opt-in Temporal workflow wrapping via KODO_EXECUTION_MODE=temporal. When unset (default "direct"), existing direct Ray execution is unchanged. - Add AgentWorkflow with pause/resume/cancel signals and status queries - Add execute_agent activity wrapping create_runner() with heartbeats - Add temporal_worker module and `koco temporal-worker` CLI command - Update Launch() to branch to Temporal path when mode is "temporal" - Add temporalio as optional dependency group - 16 unit tests covering dataclasses, workflow structure, config, and CLI Addresses issue masumi-network#8 (scale spooler) — durability layer.
| error="Activity cancelled by Temporal") | ||
| except Exception as e: | ||
| return AgentJobResult( | ||
| fid=fid, status="failed", error=str(e)) |
There was a problem hiding this comment.
Activity swallows exceptions, preventing Temporal retry policy
High Severity
The execute_agent activity catches all exceptions (both Exception and asyncio.CancelledError) and returns an AgentJobResult instead of letting them propagate. Temporal's retry policy only triggers when an activity raises an exception — a successful return value, even one containing status="failed", is treated as a completed activity. This means the RetryPolicy(maximum_attempts=3) configured in AgentWorkflow.run will never activate for code-level failures (e.g., create_runner failing, Ray errors). Only process-level crashes detected via heartbeat timeout would trigger retries, defeating a core stated goal of the integration.
Additional Locations (1)
| @workflow.signal | ||
| async def resume(self): | ||
| self._paused = False | ||
| self._status = "running" |
There was a problem hiding this comment.
Pause/resume signals have no effect on execution
High Severity
The pause and resume signal handlers only mutate _paused and _status state variables, but run() never checks _paused — it immediately starts execute_activity and awaits completion. Without a workflow.wait_condition(lambda: not self._paused) call, sending a pause signal changes queryable state but does not actually pause activity execution. Similarly, cancel_workflow is only checked once before the activity starts; once execute_activity is running, the signal has no effect.
Additional Locations (1)
| jwt=request.cookies.get(TOKEN_KEY) or request.headers.get( | ||
| HEADER_KEY), | ||
| inputs=inputs_dict, | ||
| extra=extra, |
There was a problem hiding this comment.
Unsanitized extra dict breaks Temporal JSON serialization
High Severity
_launch_temporal carefully converts entry_point (callable→string) and inputs (BaseModel→dict) for Temporal JSON serialization, but passes extra through unmodified. For endpoints defined with app.enter(), _method_lookup stores a Model instance (a non-serializable custom class) in extra['model']. Temporal's DataConverter will fail to JSON-serialize the AgentJobInput dataclass, causing a runtime error. Since enter() is the primary way to define interactive agent endpoints, this breaks the Temporal path for the main Kodosumi use case.
Additional Locations (1)
| jwt=job_input.jwt, | ||
| panel_url=job_input.panel_url, | ||
| fid=job_input.fid, | ||
| ) |
There was a problem hiding this comment.
Missing actor cleanup makes retry always fail
High Severity
When a Temporal worker crashes, the detached Ray Runner actor (created with lifetime="detached" and name=fid) survives in the Ray cluster. On retry, create_runner() is called with the same fid via fid=job_input.fid, which attempts to create a new detached actor with an identical name — this raises a ValueError from Ray because the actor already exists. There is no cleanup of the previous actor before creation. The existing kill_runner() helper could handle this, but it's never called. This makes crash recovery — the stated primary goal of the Temporal integration — non-functional whenever the Ray cluster survives the worker failure.
There was a problem hiding this comment.
DevRel Review — Temporal Durable Execution Layer
The approach here is solid. Temporal is a natural fit for long-running agent jobs that need crash recovery — Ray's own fault tolerance is actor-level, not workflow-level, so this fills a real gap for production deployments.
What looks good:
- Opt-in via
KODO_EXECUTION_MODE=temporalwith zero behavioural change when unset — safe to merge alongside existing deployments - Signal/query surface (
pause,resume,cancel,get_status,is_paused,get_error) matches what operators need for observability - 16 unit tests covering dataclasses, config, workflow structure, and CLI is a reasonable baseline
koco temporal-workerCLI aligns with the existingkocoCLI pattern
Questions for maintainers to consider before merge:
-
Retry configuration —
max_attempts=3andheartbeat_timeout=120sare currently hardcoded. For production operators running long inference jobs (>2 min), 120s heartbeat timeout may be too tight. Are these intended to be env-var configurable, or is that a follow-up PR? -
Ray interaction — When
EXECUTION_MODE=temporal, the Temporal worker still callscreate_runner()which presumably uses Ray under the hood. Is there any risk of double-scheduling if both Ray's actor supervision and Temporal's retry logic try to recover the same execution? TheSCALING.mddoesn't address this interaction. -
Temporal server dependency — The test plan assumes a local Temporal server (
tctl/temporal server start-dev). The Docker Compose PR (#33) should probably include Temporal server as a service to make the dev loop self-contained — worth coordinating with @Bajuzjefe. -
Worker process lifecycle — If
koco temporal-workerexits, does Temporal queue jobs for when it reconnects, or do in-flight executions need manual retrigger? Worth documenting the recovery guarantees clearly inSCALING.md.
Doc note: If this is merged, the Kodosumi installation guide at docs.kodosumi.io will need a new section covering Temporal worker setup, env vars, and the koco temporal-worker command. Happy to help draft that once the PR lands.
Overall this is a well-structured contribution addressing a real production gap. The test coverage and opt-in design are exactly right.


Summary
KODO_EXECUTION_MODE=temporalEXECUTION_MODE=direct— zero change for existing deploymentsAgentWorkflowwraps Runner with automatic retries (3 attempts), heartbeat monitoring (120s), and crash recoveryexecute_agentactivity callscreate_runner()— all events, forms, locks work identicallykoco temporal-workerCLI command to run the worker processAddresses issue #8 (scale spooler) — durability layer.
Test plan
KODO_EXECUTION_MODE=director unset)KODO_EXECUTION_MODE=temporal+ start Temporal server → executions go through Temporalkoco temporal-worker→ worker connects and processes agent jobspytest tests/test_temporal.py -v→ all 16 tests pass