Skip to content

fix(rhai-workflow): prevent overflow-driven retry storm in scripted nodes#1336

Open
kshirajahere wants to merge 1 commit intomofa-org:mainfrom
kshirajahere:fix/rhai-workflow-timeout-and-backoff
Open

fix(rhai-workflow): prevent overflow-driven retry storm in scripted nodes#1336
kshirajahere wants to merge 1 commit intomofa-org:mainfrom
kshirajahere:fix/rhai-workflow-timeout-and-backoff

Conversation

@kshirajahere
Copy link
Contributor

Summary

Fix a critical overflow in Rhai scripted workflow retry backoff used by agentic pipeline nodes.

ScriptWorkflowNode::execute previously used unchecked arithmetic:

  • 100 * 2u64.pow(retry_count)

For large retry counts, this overflows:

  • Debug builds: panic/crash
  • Release builds: wraparound, often to �ms delay

A �ms delay under repeated failure creates an unthrottled retry storm that can starve worker loops and overload downstream dependencies.

Related Issues

Closes #1329

Context

This directly impacts reliability of scripted agentic design patterns (conditional branches, transform chains, validation loops) built on ScriptWorkflowNode. Robust backoff behavior is foundational for iterative framework work under the GSoC “classic agentic design patterns” scope.

Changes

  • Add MAX_SCRIPT_NODE_BACKOFF_MS safety cap (60_000ms)
  • Add helper script_node_retry_delay_ms(retry_count) with checked arithmetic
  • Replace unsafe retry delay computation in ScriptWorkflowNode::execute
  • Add regression test: est_script_node_retry_delay_large_count_is_capped

How you Tested

�ash cargo test -p mofa-extra script_node_retry_delay_large_count_is_capped -- --nocapture cargo test -p mofa-extra test_simple_workflow_execution -- --nocapture cargo test -p mofa-extra test_conditional_workflow -- --nocapture

All passed.

Screenshots / Logs (if applicable)

Key output:

  • est rhai::workflow::tests::test_script_node_retry_delay_large_count_is_capped ... ok
  • est rhai::workflow::tests::test_simple_workflow_execution ... ok
  • est rhai::workflow::tests::test_conditional_workflow ... ok

⚠️ Breaking Changes

  • No breaking changes
  • Breaking change (describe below)

🧹 Checklist

Code Quality

  • Code follows Rust idioms and project conventions
  • cargo fmt run
  • cargo clippy passes without warnings

Testing

  • Tests added/updated
  • cargo test passes locally without any error

Documentation

  • Public APIs documented
  • README / docs updated (if needed)

PR Hygiene

  • PR is small and focused (one logical change)
  • Branch is up to date with main
  • No unrelated commits
  • Commit messages explain why, not only what

Deployment Notes (if applicable)

No migration required.

Additional Notes for Reviewers

This patch intentionally keeps existing retry semantics but removes overflow-driven undefined behavior by enforcing checked arithmetic + hard cap.

In ScriptWorkflowNode::execute, retry delay used 100 * 2u64.pow(retry_count).
For large retry_count values this overflows:
- debug builds: panic (arithmetic overflow)
- release builds: wraparound, often to 0ms, producing an unthrottled retry storm

This is especially dangerous for scripted agentic patterns where max_retries may
be configured dynamically: a single failing scripted node can hammer the runtime
loop and downstream dependencies.

Use checked arithmetic and clamp to a safe global cap (60s), then add a
regression test covering retry_count values that previously overflowed
(55, 63, 64, 65, 1000, u32::MAX).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] P1: Rhai scripted workflow retry backoff overflows to 0ms, causing retry storms in agentic pipelines

1 participant