Skip to content

Churn simulator: validate 80% completion at 30% churn over 72h #51

@jeremymanning

Description

@jeremymanning

Description

Per spec T144c and whitepaper Phase 2 requirement: "Over a 72-hour continuous run, 80% of submitted test jobs complete correctly with 30% simulated node churn."

Requirements

  • Build configurable churn simulator: random node kill/rejoin at configurable rate
  • Integrate with multi-node test harness
  • Track job completion rates under churn
  • Validate checkpoint/resume works across node failures
  • Run for configurable duration (target: 72 hours for Phase 2)
  • Report completion rate, data loss events, and recovery metrics

Success Criteria

  • Simulator can kill/rejoin nodes at configurable rate (target: 30% churn)
  • Job completion rate ≥ 80% at 30% churn over 72 hours
  • Zero data loss events during churn
  • Checkpoint/resume works correctly across node failures
  • Metrics reported: completion rate, recovery time, data loss count

Testing (Principle V)

  • Run on 20+ node testbed with real hardware
  • 30% churn rate → verify 80% job completion
  • Verify zero data loss during churn
  • Measure checkpoint/resume latency under various churn rates

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions