Skip to content

fix(foundation): prevent integer overflow in RetryPolicy::get_delay#1326

Open
kshirajahere wants to merge 1 commit intomofa-org:mainfrom
kshirajahere:fix/retry-policy-exponential-overflow
Open

fix(foundation): prevent integer overflow in RetryPolicy::get_delay#1326
kshirajahere wants to merge 1 commit intomofa-org:mainfrom
kshirajahere:fix/retry-policy-exponential-overflow

Conversation

@kshirajahere
Copy link
Contributor

@kshirajahere kshirajahere commented Mar 17, 2026

Summary

RetryPolicy::get_delay in mofa-foundation used bare u64::pow() and * for exponential backoff, causing integer overflow when retry_count is large.

  • Debug builds: arithmetic-overflow panic -> process crash
  • Release builds: silent wraparound -> delay collapses to 0 -> zero-delay retry loop -> unthrottled retry storm (self-DoS + downstream DoS)

Replace with checked_pow + checked_mul, saturating at max_delay_ms on overflow (which is the intentional cap anyway).

Related Issues

Closes #1325

Context

retry_count is a u32 parameter and max_retries in RetryPolicy is also u32. Any workflow config with max_retries >= 55 (combined with the default retry_delay_ms = 1000) will trigger the overflow in the exponential path. The existing unit test only exercised counts 0-10, leaving the overflow undetected.

Changes

crates/mofa-foundation/src/workflow/node.rs

  • RetryPolicy::get_delay: replace self.retry_delay_ms * 2u64.pow(retry_count) with checked_pow + checked_mul, unwrapping to max_delay_ms on overflow
  • Add test_retry_policy_large_count_does_not_overflow regression test covering counts 55, 63, 64, 65, 100, 1000, and u32::MAX

How you Tested

cargo test -p mofa-foundation test_retry_policy

Results:

  • test workflow::node::tests::test_retry_policy ... ok
  • test workflow::node::tests::test_retry_policy_large_count_does_not_overflow ... ok

Screenshots / Logs

running 5 tests
test recovery::tests::test_retry_policy_no_retry ... ok
test recovery::tests::test_retry_policy_default ... ok
test recovery::tests::test_retry_policy_builder ... ok
test workflow::node::tests::test_retry_policy ... ok
test workflow::node::tests::test_retry_policy_large_count_does_not_overflow ... ok
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured

Breaking Changes

  • No breaking changes

The fix changes only the internal arithmetic. The public API, default values, and observable behaviour for normal retry counts (0-53 with default delay) are identical.

Checklist

Code Quality

  • Code follows Rust idioms and project conventions
  • cargo fmt run
  • cargo clippy passes without warnings

Testing

  • Tests added/updated
  • cargo test passes locally without any error

Documentation

  • Public APIs documented
  • README / docs updated (if needed)

PR Hygiene

  • PR is small and focused (one logical change)
  • Branch is up to date with main
  • No unrelated commits
  • Commit messages explain why, not only what

Deployment Notes

No migration required. Single crate, single function, no API change.

Additional Notes for Reviewers

The unwrap_or(self.max_delay_ms) default is intentional: if the arithmetic would overflow, the delay is already beyond the configured maximum, so returning max_delay_ms is both mathematically correct and safe. The subsequent .min(self.max_delay_ms) is kept as a belt-and-suspenders guard for the normal (non-overflow) path.

2u64.pow(retry_count) overflows for retry_count >= 64:
  - debug builds: arithmetic-overflow panic -> crash
  - release builds: wraps to 0 -> delay becomes 0 -> unthrottled
    retry storm that DOSes both the agent and the upstream service

Switch to checked_pow + checked_mul and saturate at max_delay_ms when
either operation would overflow. This is the correct behaviour anyway:
if 2^retry_count * base_ms exceeds the configured cap, use the cap.

Adds regression test covering retry_count values 55, 63, 64, 65, 100,
1000, and u32::MAX, all of which overflowed with the old code.
Copy link
Contributor Author

@kshirajahere kshirajahere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore

@kshirajahere
Copy link
Contributor Author

Please ignore my previous one-word review comment (ignore). It was an accidental tooling artifact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Integer overflow in RetryPolicy::get_delay causes panic (debug) or unthrottled retry storm (release)

1 participant