AI Agent Checkpoint and Resume

AI agents crash. State gets lost. AXME makes every agent operation durable by default - no checkpoint code, no framework lock-in.

Your agent crashes at step 47 of 50. LangGraph: write checkpoint code, configure PostgresSaver, manage state manually. CrewAI: start over from step 1. AXME: restart the agent. It picks up at step 47 automatically.

Alpha - Built with AXME (AXP Intent Protocol). cloud.axme.ai - contact@axme.ai

The Problem

AI agents do multi-step work. They crash. The state is gone.

Agent starts ETL pipeline:
  [1/4] Extract    - done
  [2/4] Validate   - done
  [3/4] Transform  - done
  [4/4] Load       - CRASH (OOM, network, restart)

What now?
  - LangGraph: did you configure PostgresSaver? No? Start over.
  - CrewAI: "limited state management, failures typically require restart"
  - Swarm: "no persistence, state exists only in memory"
  - Raw Python: hope you wrote checkpoint logic yourself

What breaks:

State lives in memory - process dies, state dies
Checkpoint is DIY - every framework has its own persistence mechanism (or none)
No standard - LangGraph has SqliteSaver, CrewAI has nothing, Swarm has nothing
Restart = start over - no way to resume from the last good step
Cross-machine is impossible - checkpoints are local files, not network-accessible

The Solution: Durable Intent Lifecycle

Agent starts ETL pipeline via AXME intent:
  [1/4] Extract    - done (state in AXME)
  [2/4] Validate   - done (state in AXME)
  [3/4] Transform  - done (state in AXME)
  [4/4] Load       - CRASH

Restart agent:
  AXME redelivers the intent (state: IN_PROGRESS)
  Agent resumes. No data lost. No code changes.

State is in PostgreSQL, not in memory. The agent is stateless. The platform is stateful.

Quick Start

pip install axme
export AXME_API_KEY="your-key"   # Get one: axme login

from axme import AxmeClient, AxmeClientConfig
import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

# Submit a multi-step pipeline
intent_id = client.send_intent({
    "intent_type": "intent.pipeline.process.v1",
    "to_agent": "agent://myorg/production/pipeline-agent",
    "payload": {
        "pipeline": "etl-customers",
        "steps": ["extract", "validate", "transform", "load"],
        "total_rows": 500000,
    },
})

print(f"Pipeline submitted: {intent_id}")

# Wait for completion. If agent crashes, intent stays in current state.
# Restart agent - AXME redelivers. No checkpoint code needed.
result = client.wait_for(intent_id)
print(f"Done: {result['status']}")

Before / After

Before: DIY Checkpointing (100+ lines per framework)

# LangGraph: framework-specific checkpoint setup
from langgraph.checkpoint.postgres import PostgresSaver

DB_URI = "postgresql://user:pass@localhost/checkpoints"
checkpointer = PostgresSaver.from_conn_string(DB_URI)

# Build graph with checkpointer attached
graph = builder.compile(checkpointer=checkpointer)

# Resume from checkpoint (if exists)
config = {"configurable": {"thread_id": job_id}}
state = graph.get_state(config)
if state and state.values:
    result = graph.invoke(None, config)  # resume
else:
    result = graph.invoke(initial_state, config)  # start fresh

# Plus: manage DB connection, handle schema migrations,
# clean up old checkpoints, handle serialization errors...

# CrewAI: no built-in persistence
# "limited state management, failures typically require restart"

# Swarm: no persistence at all
# "state exists only in memory"

After: AXME Durable Execution (zero checkpoint code)

intent_id = client.send_intent({
    "intent_type": "intent.pipeline.process.v1",
    "to_agent": "agent://myorg/production/pipeline-agent",
    "payload": {"pipeline": "etl-customers", "steps": ["extract", "validate", "transform", "load"]},
})
result = client.wait_for(intent_id)

No PostgresSaver. No SqliteSaver. No checkpoint DB. No schema migrations. No serialization.

The intent lifecycle is durable by default. Agent crashes, intent stays. Agent restarts, intent redelivers.

How Crash Recovery Works

Normal flow:
  CREATED -> SUBMITTED -> DELIVERED -> IN_PROGRESS -> COMPLETED

Agent crashes at IN_PROGRESS:
  CREATED -> SUBMITTED -> DELIVERED -> IN_PROGRESS -> (agent dies)

  Intent stays at IN_PROGRESS in AXME's PostgreSQL.
  No timer. No cron. Just durable state.

Agent restarts:
  Agent calls client.listen("pipeline-agent-demo")
  AXME redelivers the intent (max_delivery_attempts: 3)
  Agent picks up where it left off.
  IN_PROGRESS -> COMPLETED

Comparison With Framework Checkpointing

	LangGraph	CrewAI	Swarm	AXME
Persistence	PostgresSaver (opt-in)	None	None	Default (built-in)
Checkpoint code	20+ lines	N/A	N/A	0 lines
DB setup	You manage	N/A	N/A	Managed (AXME Cloud)
Resume after crash	From last checkpoint	Start over	Start over	Automatic redelivery
Cross-machine	No (local state)	No	No	Yes (network state)
Framework lock-in	LangGraph only	CrewAI only	Swarm only	Any framework

How It Works

+-----------+  send_intent()   +----------------+  deliver     +-----------+
|           | ---------------> |                | -----------> |           |
| Initiator |                  |   AXME Cloud   |              |   Agent   |
|           | <- wait_for() -- |   (platform)   | <- resume()  |           |
|           |  resumes when    |                |  with result |           |
|           |  agent completes | intent state   |              | processes |
|           |                  | in PostgreSQL  |  crash?      | steps     |
|           |                  | (durable)      |  redeliver!  |           |
+-----------+                  +----------------+              +-----------+

Initiator submits a pipeline intent with steps and data
AXME delivers to the agent via SSE stream
Agent processes steps and resumes with result
If agent crashes - intent stays at current state in PostgreSQL
Agent restarts - AXME redelivers the intent (up to 3 attempts)
Agent picks up and completes the pipeline

Works With Any Agent Framework

AXME durability works underneath any framework. No framework changes needed.

Framework	How AXME Adds Durability
LangGraph	Replace PostgresSaver with AXME intent lifecycle
CrewAI	Add durability where CrewAI has none
AutoGen	Durable multi-agent conversations across restarts
OpenAI Agents SDK	Survive crashes without losing agent state
Any Python code	`send_intent()` + `wait_for()` = durable by default

Run the Full Example

Prerequisites

# Install CLI (one-time)
curl -fsSL https://raw.githubusercontent.com/AxmeAI/axme-cli/main/install.sh | sh
# Open a new terminal, or run the "source" command shown by the installer

# Log in
axme login

# Install Python SDK
pip install axme

Terminal 1 - submit the intent

axme scenarios apply scenario.json
# Note the intent_id in the output

Terminal 2 - start the agent

Get the agent key after scenario apply:

# macOS
cat ~/Library/Application\ Support/axme/scenario-agents.json | grep -A2 pipeline-agent-demo

# Linux
cat ~/.config/axme/scenario-agents.json | grep -A2 pipeline-agent-demo

Run the agent:

AXME_API_KEY=<agent-key> python agent.py

Try crash recovery

While the agent is processing, press Ctrl+C to kill it
Check intent status: axme intents get <intent_id> (still IN_PROGRESS)
Restart the agent: AXME_API_KEY=<agent-key> python agent.py
Agent picks up the intent and completes it

Verify

axme intents get <intent_id>
# lifecycle_status: COMPLETED

AXME - project overview
AXP Spec - open Intent Protocol specification
AXME Examples - 20+ runnable examples across 5 languages
AXME CLI - manage intents, agents, scenarios from the terminal
Async Human Approval for AI Agents - async approval with reminders
Durable Execution with Human Approval - what Temporal can't do

Built with AXME (AXP Intent Protocol).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agent.py		agent.py
initiator.py		initiator.py
requirements.txt		requirements.txt
scenario.json		scenario.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Agent Checkpoint and Resume

The Problem

The Solution: Durable Intent Lifecycle

Quick Start

Before / After

Before: DIY Checkpointing (100+ lines per framework)

After: AXME Durable Execution (zero checkpoint code)

How Crash Recovery Works

Comparison With Framework Checkpointing

How It Works

Works With Any Agent Framework

Run the Full Example

Prerequisites

Terminal 1 - submit the intent

Terminal 2 - start the agent

Try crash recovery

Verify

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Agent Checkpoint and Resume

The Problem

The Solution: Durable Intent Lifecycle

Quick Start

Before / After

Before: DIY Checkpointing (100+ lines per framework)

After: AXME Durable Execution (zero checkpoint code)

How Crash Recovery Works

Comparison With Framework Checkpointing

How It Works

Works With Any Agent Framework

Run the Full Example

Prerequisites

Terminal 1 - submit the intent

Terminal 2 - start the agent

Try crash recovery

Verify

Related

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages