Product · Jun 22, 2026

Facio's Incident Response Playbook: How AI Agents Detect, Triage, and Mitigate Production Issues Autonomously

Production AI agents need an incident response playbook — a structured way to detect issues, triage severity, mitigate damage, and escalate intelligently when human judgment is required. Facio's runtime provides the building blocks: heartbeat-driven monitoring, structured error responses, log queries, HITL escalation, and checkmarked state recovery. Combined, they let agents handle routine incidents autonomously and bring humans into the loop at exactly the right moment. Here's the playbook.

Incident ResponseOn-CallOperationsReliabilityRunbook

Facio's Incident Response Playbook: How AI Agents Detect, Triage, and Mitigate Production Issues Autonomously

Every production system has incidents. A database connection pool maxes out. A new deployment introduces a memory leak. An upstream API starts returning 500s. A cache layer goes cold and the database takes the load. The question isn't whether your AI agent will encounter production issues — it will. The question is whether it handles them like a prototype (return the error and stop) or like a production system (detect, triage, mitigate, escalate, learn).

Facio's runtime is built for the second behavior. The agent has heartbeat-driven monitoring, structured error responses, log queries, checkpoint state recovery, and HITL escalation — the building blocks of a complete incident response capability. What ties them together into a working playbook is a structured approach to the five stages of agent incident response.

This is that playbook. It's the same pattern Facio agents use in production, condensed into a form you can adapt to your own systems.

The Five Stages of Agent Incident Response

According to Augment Code's 2026 framework and the AWS DevOps Agent team's playbooks, AI agent incident response structures the incident lifecycle into five distinct stages. Facio implements each stage through specific tools and patterns:

DETECT → TRIAGE → MITIGATE → COMMUNICATE → POSTMORTEM

Each stage has a defined purpose, a defined set of tools, and a defined escalation boundary. The agent knows when to act, when to ask, and when to hand off.

Stage 1: Detect

Detection is the runtime's job, not the agent's. Facio provides two complementary detection mechanisms:

Heartbeat-driven monitoring. A scheduled heartbeat task runs every N seconds, performing a set of health checks:

# HEARTBEAT.md entry
- [ ] Health check: API server response time (every 60s)
  - exec("curl -s -o /dev/null -w '%{time_total}' https://api.example.com/health")
  - If response time > 2s: trigger Stage 2
- [ ] Health check: Database connection pool (every 60s)
  - exec("psql -c 'SELECT count(*) FROM pg_stat_activity;'")
  - If active connections > 80% of pool: trigger Stage 2
- [ ] Health check: Disk space (every 5 minutes)
  - exec("df -h | grep '/data' | awk '{print $5}' | tr -d '%'")
  - If usage > 85%: trigger Stage 2

The heartbeat is the agent's "always watching" capability. It runs autonomously, doesn't require the user to be online, and catches the issues a human would only notice when they checked dashboards.

Structured error responses. When a tool call fails (a workflow tool, an MCP server, an external API), the runtime returns a structured error with severity classification. The error is the detection signal:

exec(command="kubectl apply -f deployment.yaml")
# Response: {"status": "error", "code": "CONNECTION_REFUSED", "severity": "critical", "retryable": true}

The agent doesn't have to inspect outputs to find errors. The runtime reports them explicitly.

Stage 2: Triage

Once an issue is detected, the agent triages it. Triage has three sub-steps: classify, scope, and prioritize.

Classify. What kind of issue is this?

Symptom	Classification
`Connection refused`	Network/infrastructure
`ETIMEDOUT`	Network/latency
`OOMKilled`	Resource/memory
`5xx from upstream`	External service
`Permission denied`	Configuration/permissions
`Rate limit exceeded`	Capacity/quota
`Schema mismatch`	Code/data contract
`Authentication failed`	Security/credentials

The classification determines the response. Network issues suggest retry or failover. Permission issues suggest escalation. Rate limits suggest backoff. The agent doesn't treat all errors the same way.

Scope. How wide is the impact?

Single component, single user. Low severity. The agent can investigate and fix.
Single component, multiple users. Medium severity. The agent investigates, mitigates, and notifies.
Multiple components or full system. High severity. The agent mitigates what it can, then escalates immediately.

The agent uses read_logs and live queries to estimate the scope before acting.

Prioritize. Given the classification and scope, what's the response priority?

P0 (Critical, full system down): Mitigate immediately, escalate simultaneously. Both happen in parallel.
P1 (Major, significant impact): Mitigate, then escalate. The mitigation buys time for the human to make decisions.
P2 (Minor, limited impact): Investigate fully, then either mitigate or escalate based on findings.
P3 (Warning, no immediate impact): Document, schedule investigation, no immediate action.

Triage turns a raw error signal into a structured response plan. The agent knows what to do before taking the first action.

Stage 3: Mitigate

Mitigation is the action phase. The agent has a playbook of mitigation actions, ordered by safety:

Tier 1: Read-only diagnostic actions. The agent gathers more information without changing anything.

# Check current state
exec(command="kubectl get pods -n production")
# Check recent changes
exec(command="git log --since='2 hours ago' --oneline")
# Check error rate
read_logs(level="ERROR", since="1h", grep="database")

Read-only actions never make the incident worse. The agent uses them to confirm hypotheses before taking corrective action.

Tier 2: Reversible mitigations. The agent takes action that can be undone.

# Roll back a recent deployment
exec(command="kubectl rollout undo deployment/api-server")
# Increase a resource limit
exec(command="kubectl set resources deployment/api-server --limits=memory=2Gi")
# Clear a cache
exec(command="redis-cli FLUSHDB")

Reversible mitigations are bounded risk. The agent can apply them without escalation if they're clearly indicated by the triage findings.

Tier 3: Checkpoint + irreversible action. Before any irreversible action, the agent writes a checkpoint. If the action makes things worse, the agent can roll back to the pre-action state.

# Pre-action checkpoint
write_file(path="tmp/checkpoint-pre-mitigation.md", content=...)

# Irreversible action
exec(command="psql -c 'DROP TABLE temp_session_data'")

# Verify
read_logs(level="ERROR", since="1m")

# If worse: roll back using the checkpoint

The checkpoint is the safety net for high-stakes actions. The agent doesn't take irreversible actions without one.

Tier 4: Escalate-and-pause. When the issue is beyond the agent's authority or capability, the agent escalates immediately and pauses autonomous action.

ask_approval(
    title="Production database migration failed",
    description="""Migration to add customer_segment column failed. The error indicates partial state — column added but indexes missing.
    
    Options:
    1. Roll back migration (safe, requires ~5 min downtime)
    2. Manually add missing indexes (faster, requires DB admin)
    3. Investigate further before deciding (slower, no immediate risk)
    
    I recommend option 2 — fastest path to consistent state.""",
    options=[
        {"id": "option_2", "label": "Add missing indexes (recommended)"},
        {"id": "option_1", "label": "Roll back migration"},
        {"id": "option_3", "label": "Investigate further first"}
    ]
)

The escalation gives the human a structured decision. The agent pauses autonomous action until the human responds.

Stage 4: Communicate

The agent doesn't silently fix problems in the background. The user and other stakeholders need to know what's happening. Communication has three levels:

Status updates. Brief, non-actionable messages that the human can ignore if they want. Sent via message to the appropriate channel.

message(
    content="""[10:42 UTC] API server response time exceeded 2s threshold. Investigating.
[10:43 UTC] Identified: recent deployment introduced N+1 query in /users endpoint.
[10:45 UTC] Mitigation: rolled back deployment. Response time back to 200ms.
[10:47 UTC] Monitoring for recurrence. Will post postmortem within 24h.""",
    channel="placet"
)

The status update pattern is: short, timestamped, factual. The human can scan it and know the agent is working, what it found, and what it did.

Stakeholder notifications. When an incident affects customers, partners, or other teams, the agent notifies the appropriate channels. The notifications are templated and don't reveal internal jargon.

message(
    content="""Service Incident Report
    
    Time: 10:42-10:47 UTC
    Impact: API latency elevated to >2s for ~5 minutes
    Affected: ~5% of API requests in the EU region
    Status: Resolved
    
    The issue was caused by a recent deployment and has been rolled back. No data loss occurred. We're investigating the root cause and will publish a postmortem within 24 hours.""",
    channel="telegram"
)

HITL escalations. When the agent needs a human decision, the escalation goes through ask_approval, ask_form, or ask_selection — Placet's structured cards. The human sees the title, description, and options in their normal messaging interface. No separate "agent dashboard" required.

Stage 5: Postmortem

After the incident is resolved, the agent writes a postmortem. The postmortem is mandatory for P0 and P1 incidents; recommended for P2.

The postmortem captures:

# Incident Postmortem: 2026-06-22 10:42-10:47 UTC

## Summary
API server response time exceeded 2s threshold for 5 minutes. Caused by recent deployment introducing N+1 query. Mitigated by rollback. No data loss.

## Timeline
- 10:42 UTC: Heartbeat detected response time 2.3s (above 2s threshold)
- 10:42:30 UTC: Agent initiated triage — read logs, checked recent deployments
- 10:43:15 UTC: Identified commit abc123 added N+1 query in /users endpoint
- 10:44:00 UTC: Took pre-mitigation checkpoint
- 10:45:30 UTC: Rolled back deployment to previous version
- 10:46:00 UTC: Verified response time back to 200ms
- 10:47:00 UTC: Marked incident resolved, posted status update

## Root Cause
Commit abc123 (PR #342) added a new endpoint /users that loaded the full user object for every related entity. With 50+ related entities per user, the query became 50x slower for users with rich profiles.

## What Went Well
- Heartbeat caught the issue within 30 seconds
- Triage correctly identified the cause from commit history
- Rollback was fast and effective
- Status updates kept stakeholders informed

## What Went Wrong
- The N+1 query wasn't caught in code review
- No automated performance test caught it before deploy
- Heartbeat threshold (2s) is loose; should be 1s for /users

## Action Items
- [ ] Add performance test to CI for /users endpoint
- [ ] Tighten heartbeat threshold to 1s for user-facing endpoints
- [ ] Add N+1 query detector to pre-commit hook
- [ ] Review all recent PRs for similar patterns

The postmortem is committed to output/incidents/2026-06-22-api-latency.md and indexed for future recall. The action items become heartbeat tasks or inline learning entries, depending on their nature.

Severity Matrix

Not all incidents are equal. Facio's playbook defines a severity matrix that maps symptoms to response:

Severity	Example symptoms	Response time	Action
P0 (Critical)	Full system down, data loss, security breach	Immediate	Mitigate + escalate simultaneously
P1 (Major)	Significant degradation, multiple components affected	< 5 min	Mitigate, then escalate
P2 (Minor)	Limited degradation, single component	< 15 min	Investigate, mitigate if clear, escalate if uncertain
P3 (Warning)	Approaching threshold, no immediate impact	< 1 hour	Document, schedule investigation

The matrix is the agent's decision support. When the triage findings map to P0, the agent doesn't wait — it acts and escalates in parallel. When the findings map to P3, the agent logs and moves on.

The Escalation Boundary

The agent handles most incidents autonomously. The escalation boundary — when the agent asks for human help — is defined by three rules:

Rule 1: Action reversibility. Reversible actions don't need approval. The agent can roll back a deployment, increase a resource limit, clear a cache, restart a pod. Irreversible actions (drop tables, delete records, change billing) always need approval.

Rule 2: Action scope. Actions affecting a single component don't need approval. Actions affecting multiple components or shared infrastructure need approval.

Rule 3: Confidence level. When the agent is confident in the diagnosis and the mitigation is well-understood, it acts. When the diagnosis is uncertain or the mitigation could make things worse, it escalates.

The three rules combine into a clear escalation policy. The agent doesn't ask for permission for routine, reversible, well-understood actions. It does ask for permission for high-stakes, irreversible, or uncertain actions.

Detection Signals in Practice

The TianPan.co playbook highlights the importance of well-calibrated detection thresholds. The agent's heartbeat tasks need thresholds that catch real issues without flooding the human with false positives.

A common calibration pattern:

Week 1: Tight thresholds. Catch everything, accept noise. The human reviews every alert and tags which were real.
Week 2: Adjust. Loosen thresholds for categories that produced only false positives. Tighten for categories that produced only real issues.
Week 3: Lock in. The thresholds are now calibrated. The agent applies them confidently.

The Reflection process periodically reviews the threshold calibration. If false-positive rates climb, the thresholds are revisited.

The Runbook-as-Code Pattern

For known incident types, the runbook is encoded as a heartbeat task. The agent has explicit instructions for what to do when a specific symptom appears:

# HEARTBEAT.md
- [ ] **Runbook: Database connection pool exhaustion**
    - Trigger: psql connection count > 80% of pool for > 2 minutes
    - Actions:
      1. exec("psql -c 'SELECT pid, state, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start LIMIT 20'") → identify long-running queries
      2. If queries > 5min: exec("SELECT pg_cancel_backend(pid)") for each
      3. exec("SELECT pg_terminate_backend(pid)") only as last resort
      4. message("Database pool exhausted, terminated N queries, monitoring")
      5. If pool still > 90% after 5 min: ask_approval for emergency pool increase

The runbook-as-code pattern is the agent's institutional knowledge for incident response. New runbooks are added as the agent encounters new incident types. Reflection consolidates the runbooks over time.

The Compound Effect

Incident response is one of the highest-leverage capabilities for an AI agent. Every incident handled autonomously is one fewer page in the middle of the night. Every postmortem captured is institutional knowledge that improves future responses. Every runbook added is preparation for the next occurrence of the same problem.

The agent that runs the incident response playbook:

Catches issues faster than humans. Heartbeat monitoring runs 24/7, doesn't sleep, doesn't miss a check.
Triages more accurately. Structured error responses and severity matrices reduce subjective judgment under pressure.
Mitigates more safely. Checkpoints, reversibility awareness, and escalation boundaries prevent the agent from making things worse.
Communicates more reliably. Templated status updates and structured escalations keep stakeholders informed.
Learns more effectively. Postmortems are captured, runbooks are encoded, lessons are applied.

The agent without an incident response playbook is a research tool that breaks under pressure. The agent with one is an operational teammate that handles production issues autonomously, escalates intelligently, and improves over time.

Bottom Line

Production AI agents encounter incidents. The question is whether they handle them well. Facio's incident response playbook is the structured approach: detect through heartbeat and structured errors, triage through classification and severity matrices, mitigate through tiered actions with checkpoints, communicate through templated updates and HITL escalations, and postmortem through captured timelines and action items.

The playbook turns the runtime's building blocks — monitoring, errors, logs, checkpoints, HITL — into a working incident response capability. The agent that runs the playbook handles routine incidents autonomously and brings humans into the loop at exactly the right moment. The agent that doesn't, doesn't.

Because production agents are measured by what happens when things go wrong. The playbook is what makes the wrong-handling right.

See the incident response documentation for runbook templates, severity matrix configuration, and postmortem frameworks.

Facio's Incident Response Playbook: How AI Agents Detect, Triage, and Mitigate Production Issues Autonomously

Facio's Incident Response Playbook: How AI Agents Detect, Triage, and Mitigate Production Issues Autonomously

The Five Stages of Agent Incident Response

Stage 1: Detect

Stage 2: Triage

Stage 3: Mitigate

Stage 4: Communicate

Stage 5: Postmortem

Severity Matrix

The Escalation Boundary

Detection Signals in Practice

The Runbook-as-Code Pattern

The Compound Effect

Bottom Line

More on Product

Facio's Checkpoint Discipline: How Periodic State Snapshots Let AI Agents Resume After Crashes and Budget Exhaustion

How Facio Picks the Right Tool for the Job: The Decision Framework Behind Tool Selection in AI Agents

Facio's Reflection Process: How Periodic Memory Curation Keeps AI Agents Honest at Scale