Skip to content

Scheduler: real broker matchmaking, lease management, and job lifecycle #55

@jeremymanning

Description

@jeremymanning

Description

The scheduler module has structural definitions (job/task/workflow state machines, manifest validation, priority scoring) but the runtime broker logic is not wired for real multi-node operation:

  • ClassAd-style bilateral matchmaking (task requirements ↔ agent capabilities)
  • Lease issuance, renewal via heartbeat, and expiry handling
  • Speculative execution and lineage tracking
  • R=3 replica placement with disjoint-AS enforcement
  • Checkpoint commit flow through data plane
  • Regional broker election and failover

Requirements

  • Broker matches tasks to agents based on capability profiles
  • Leases issued with configurable TTL, renewed on heartbeat
  • Expired leases trigger rescheduling
  • R=3 replicas placed on disjoint autonomous systems
  • Speculative execution for latency-sensitive tasks
  • Checkpoint flow: sandbox → CID store → erasure coding → placement

Success Criteria

  • Broker matches tasks to capable agents
  • Leases issued, renewed, and expired correctly
  • R=3 replicas on disjoint AS
  • Checkpoint commit flow works end-to-end
  • Integration test: multi-node job lifecycle (submit → schedule → execute → verify → complete)
  • cargo test passes

Testing (Principle V)

  • Multi-node cluster → submit job → verify broker matches to capable node
  • Kill executor mid-task → verify rescheduling from checkpoint
  • Submit job requiring GPU → verify matched only to GPU nodes
  • Verify R=3 placement uses disjoint nodes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions