Facio's Checkpoint Discipline: How Periodic State Snapshots Let AI Agents Resume After Crashes and Budget Exhaustion
An AI agent that loses its work to a crash, an exhausted budget, or a network failure is starting over from scratch. The hours of research, the partially-completed deployment, the half-written report, the carefully-curated memory updates — all gone. The next session begins with no knowledge of the previous one's progress. The agent redoes the work, makes the same mistakes, re-fetches the same sources, and the cycle repeats.
For short tasks — "summarize this article," "look up this fact" — the cost of starting over is low. For long-running work — multi-stage deployments, comprehensive research, complex analyses, iterative refactors — the cost is devastating. The user gave the agent 90 minutes of work; a crash at minute 85 wastes the entire investment.
Facio's checkpoint discipline turns sessions into resumable workflows. Through periodic state snapshots persisted to the workspace, the agent can resume from any checkpoint after any interruption. The work isn't lost; it's paused. The next session picks up where the previous one stopped, with full context of what was done, what's pending, and what comes next.
Here's how checkpoints work, when the agent takes them, and why checkpoint-aware agents finish work that checkpoint-blind agents abandon.
The Checkpoint Model
A checkpoint is a structured snapshot of the agent's session state — a markdown file (or JSON, depending on complexity) that captures everything needed to resume the work:
# Checkpoint: 2026-06-20 09:47 UTC
## Mission
Comprehensive security review of authentication flow in projects/api-server/
## Progress
- [x] Codebase mapped (28 files analyzed)
- [x] Authentication flow traced (8 endpoints)
- [x] Threat model drafted (5 attack vectors identified)
- [x] Vulnerability findings recorded (3 critical, 5 high)
- [ ] Remediation recommendations (in progress, 3 of 8 drafted)
- [ ] Executive summary (pending)
- [ ] Final report (pending)
## Current State
- Working on remediation recommendation #4 (token rotation policy)
- Last file read: projects/api-server/src/auth/tokens.py
- Last tool call: edit_file to add note about JWT expiration
## Open Questions for Human
- Is the 30-day token rotation policy acceptable, or should we recommend 7-day?
- Do we need to consider legacy clients that don't support refresh tokens?
## Artifacts Produced
- output/security-review/threat-model.md (draft)
- output/security-review/findings.md (draft)
- output/security-review/remediation.md (partial)
- memory/MEMORY.md (updated with security context)
## Resumption Point
On resume: continue remediation recommendations at item #4, then complete executive summary and final report.
## Iteration Stats
- Iterations used: 38
- Budget remaining: 12
- Session duration: 47 minutes
The checkpoint is a self-contained resume point. The next session reads this file and continues from exactly where the previous one stopped.
When Checkpoints Happen
Facio's checkpoint discipline has four trigger types:
Trigger 1: Budget Threshold
When the iteration budget crosses configured thresholds (default 75% and 90%), the agent writes a checkpoint. The checkpoint captures the work-to-date so the agent can resume on the next session, even if the current session exhausts the budget.
# Iteration Budget: 12 remaining (24% of original 50)
# Checkpoint triggered (75% threshold crossed)
write_file(path="tmp/checkpoint.md", content=...)
The threshold trigger is the most common checkpoint. It's the safety net for "I might run out of budget before I finish."
Trigger 2: After Major Milestones
When the agent completes a significant phase of work — finishes the research, completes the analysis, ships a major component — the agent writes a checkpoint to preserve the milestone. The next session starts with "research is done, now we move to analysis" rather than "where are we?"
Milestone checkpoints are deliberate. The agent identifies a natural pause point and captures it.
Trigger 3: Before Destructive Operations
Before any operation that could fail partway through (a multi-step deployment, a database migration, a code refactor), the agent writes a checkpoint. If the operation fails, the agent can roll back to the checkpoint state rather than the empty starting state.
# Before deploying:
write_file(path="tmp/checkpoint-pre-deploy.md", content=...)
# Run the deployment
exec(command="kubectl apply -f deployment.yaml")
# If success: write post-deploy checkpoint
# If failure: rollback to pre-deploy checkpoint
The destructive-operation trigger is the rollback safety mechanism. The agent doesn't proceed without a recovery point.
Trigger 4: On Termination Signal
When the session is about to terminate (budget exhausted, user cancels, system shutdown), the runtime gives the agent a final chance to write a checkpoint. The termination checkpoint captures everything that was done, what's pending, and the resumption point.
The termination checkpoint is the last-resort capture. The runtime ensures the agent can write a final state before the session ends.
What Goes in a Checkpoint
A Facio checkpoint is structured to be both human-readable and agent-actionable. The standard sections:
Mission
The original task statement, captured verbatim from the user's prompt or the task description. The next session needs to know what the user actually asked for.
Progress
A structured checklist of what was done and what's pending. The agent uses the checklist to resume work — the next iteration looks at the checklist, finds the first unchecked item, and continues.
Current State
The immediate context the agent was working in: which file was being edited, which step of a process was running, which sub-task was active. This is the most operationally specific information; it's what the agent needs to "pick up the thread."
Open Questions
Anything the agent is blocked on or uncertain about. These are the questions for the human or for further research. The next session can resolve them or escalate them.
Artifacts Produced
A list of every file written, every memory entry added, every significant output created during the session. The next session knows what already exists and doesn't need to recreate it.
Resumption Point
A clear, specific instruction for what to do next: "Continue remediation recommendations at item #4, then complete executive summary and final report." The instruction is concrete enough that the next agent can start immediately.
Iteration Stats
The session's resource usage: iterations consumed, budget remaining, wall-clock duration. This information is useful for diagnosing patterns ("we always run out of budget at 38 iterations — maybe the budget should be higher for security reviews").
How the Next Session Resumes
When a new session starts, the agent's first action is to check for an existing checkpoint:
# Session start
glob(pattern="tmp/checkpoint*.md")
# If found: read the most recent checkpoint
read_file(path="tmp/checkpoint-2026-06-20-0947.md")
# If not found: start fresh
If a checkpoint exists, the agent reads it and continues:
# Resumed from checkpoint
# Mission: Comprehensive security review of authentication flow
# Progress: 4 of 7 milestones complete
# Next: Remediation recommendations (item #4)
# Continue from resumption point
edit_file(path="output/security-review/remediation.md", ...)
The user doesn't have to do anything to trigger resumption. The agent detects the checkpoint, loads the context, and continues. The user experience is "I came back the next day and the work was further along than I expected" — without the user having to brief the agent on the previous session.
The Checkpoint File Lifecycle
Checkpoints have a lifecycle. They're written, used, and eventually retired:
Phase 1: Active Checkpoint
When a checkpoint is the most recent resumption point, it's "active." The next session reads it. The agent continues from it.
Phase 2: Superseded Checkpoint
When a new checkpoint is written (a new milestone completed, a destructive operation succeeded), the previous checkpoint becomes "superseded." It's still in the workspace, but it's no longer the active resumption point.
Phase 3: Archived Checkpoint
When the overall work is complete, the agent archives the checkpoint chain to output/<project>/checkpoints/ and removes the active checkpoint from tmp/. The archived checkpoints are still recoverable via recall if needed for retrospective analysis.
Phase 4: Reflected Learnings
Reflection reviews completed checkpoint chains periodically. Patterns emerge: "we always checkpoint at the same points," "we never resume from checkpoints older than 24 hours," "destructive operations followed by failure often need manual rollback." These patterns are consolidated into heuristics for future sessions.
The lifecycle ensures checkpoints don't accumulate as cruft while preserving the historical record.
The Workspace Organization
Checkpoints live in tmp/ because they're intermediate state, not deliverables:
workspace/
├── projects/
│ └── security-review/
│ ├── threat-model.md # Deliverable
│ ├── findings.md # Deliverable
│ ├── remediation.md # Deliverable
│ └── ...
├── memory/
│ ├── MEMORY.md # Long-term memory
│ └── history.jsonl # Conversation history
├── output/
│ └── security-review/
│ └── checkpoints/ # Archived checkpoints
│ ├── 2026-06-19-1430.md
│ ├── 2026-06-20-0947.md
│ └── ...
└── tmp/
└── checkpoint-active.md # Current resumption point
The organization follows Facio's workspace conventions. tmp/ for scratch state, output/ for deliverables, memory/ for long-term knowledge, projects/ for the actual work. Checkpoints fit naturally into this structure.
Checkpoint Patterns
Pattern 1: The Long Research Project
# Session 1 (budget: 50)
- Research phase (10 iterations): gathered 30 sources
- Checkpoint at 75% budget
- Analysis phase (28 iterations): synthesized findings
- Hit budget during write-up
- Termination checkpoint written
# Session 2 (budget: 50, new session)
- Reads checkpoint
- Continues write-up (20 iterations)
- Hits budget during final review
- Milestone checkpoint written
# Session 3 (budget: 50, new session)
- Reads checkpoint
- Completes final review (15 iterations)
- Delivers report
- Archives checkpoint chain
Three sessions, 135 total iterations, no work lost between sessions. The user came back to a project that was making progress every day, even when each session ran out of budget.
Pattern 2: The Cautious Deployment
# Before deployment
- Pre-deploy checkpoint: current state, what will change
- Run deployment steps
- Post-deploy checkpoint: new state, what changed
# If deployment fails:
- Rollback to pre-deploy checkpoint
- Investigate failure
- Retry with fixes
# If deployment succeeds:
- Post-deploy checkpoint becomes the new baseline
- Continue with verification
The destructive-operation pattern. The agent doesn't deploy without a recovery point. Failures are recoverable, not catastrophic.
Pattern 3: The Multi-Day Content Project
# Day 1: Research and outline
- Read sources, draft outline
- Checkpoint at end of day
# Day 2 (next session): Draft sections
- Reads day 1 checkpoint
- Drafts each section
- Checkpoint after each section
# Day 3 (next session): Review and polish
- Reads day 2 checkpoint
- Reviews draft, polishes prose
- Delivers final version
The work continues across days. The user doesn't have to brief the agent each morning. The checkpoint carries the context.
What the Checkpoint Discipline Doesn't Do
Honest limitations:
- It doesn't capture external state. A checkpoint captures the agent's understanding of the work, not the actual state of external systems. If the agent deploys a configuration and then crashes, the deployment may have succeeded even if the agent's checkpoint says "in progress." The next session needs to verify external state.
- It doesn't make checkpoints automatic. The agent has to choose to write checkpoints. The runtime triggers threshold checkpoints, but milestone and pre-destructive-operation checkpoints are the agent's discipline. A poorly-designed agent won't take checkpoints when it should.
- It doesn't survive workspace loss. Checkpoints live in the workspace. If the workspace is corrupted, lost, or wiped, the checkpoints are gone. Production deployments should back up the workspace regularly.
- It doesn't replace the audit trail. Checkpoints are snapshots, not logs. The full audit trail is in the runtime logs; checkpoints are the resumption points within those logs.
The Compound Effect of Checkpoint Discipline
Checkpoint discipline compounds. The agent that takes checkpoints regularly:
- Resumes work that other agents abandon. Long-running projects get finished.
- Avoids the "starting over" tax. The 90 minutes of work before a crash aren't lost.
- Enables multi-day workflows. The user can leave and come back; the work continues.
- Provides operational safety. Destructive operations have rollback points.
- Improves the audit trail. The checkpoint chain is a record of the work's evolution.
The agent without checkpoint discipline has a different relationship with work. Every interruption is a potential loss. Every long task is a risk. Every session is potentially wasted.
The structural difference is the discipline. The agent that takes checkpoints finishes work. The agent that doesn't, doesn't.
Bottom Line
Sessions that lose state are sessions that waste work. The agent crashes at minute 85 of a 90-minute task; the user loses 85 minutes. The agent runs out of budget before completing the analysis; the analysis has to be redone. The network blip terminates a deployment; the deployment has to be re-investigated from scratch.
Facio's checkpoint discipline turns sessions into resumable workflows. Periodic snapshots, threshold triggers, milestone captures, pre-destructive-operation rollbacks. The next session reads the checkpoint, continues from the resumption point, and finishes the work.
The cost is small: a few extra write_file calls per session. The benefit is large: no work lost, no starting over, no abandoned long-running projects. The discipline is the difference between an agent that finishes and one that doesn't.
Because the value of an AI agent isn't in any single session. It's in the cumulative work the agent completes across all its sessions. Checkpoint discipline is what makes that cumulative work possible.
See the checkpoint documentation for trigger configuration, file format specifications, and resumption patterns.