Facio's Self-Healing Agents: How Error Detection, Diagnosis, and Recovery Run as a Continuous Loop
Every AI agent in production will fail. A tool returns an unexpected response. An MCP server goes down. A timeout is exceeded. A rate limit is hit. An external API is unreachable. The failures are inevitable; what separates a production agent from a prototype is what happens after.
A prototype agent that hits an error returns the error to the user and stops. A production agent detects the failure, classifies it, attempts a fix, and either recovers or escalates with a clear diagnosis. The user doesn't see "something went wrong" — they see "the agent handled it" or "the agent needs your input on a specific decision."
Facio's self-healing architecture makes the second behavior the default. Through a continuous detection-diagnosis-recovery loop, the agent turns failures into recovery actions — autonomously for routine cases, with intelligent escalation for the rest. Here's how the loop works and what it enables.
The Failure Recovery Loop
Every tool call in Facio's runtime is part of an error recovery loop, whether the agent realizes it or not:
DETECT → CLASSIFY → ATTEMPT FIX → ESCALATE IF NEEDED → LEARN → REPEAT
The loop is the structural reason Facio agents behave like production systems rather than prototypes. Each stage has a defined purpose and a defined interface to the next.
Stage 1: Detection
Facio's runtime detects failures at the tool layer. Every tool call returns a structured response: success, error, timeout, blocked, or partial. The agent doesn't have to inspect outputs for failure markers; the runtime reports status explicitly.
The detection happens before the agent sees the result. By the time the response reaches the agent's context, the failure is already classified as a failure — not buried in a 200 OK response that happens to contain an error message, not hidden in a stack trace the agent has to parse.
exec(command="kubectl apply -f deployment.yaml")
# Response: {"status": "error", "code": "CONNECTION_REFUSED", "message": "API server unreachable", "retryable": true}
The agent reads the structured response. The detection is the runtime's job, not the agent's.
Stage 2: Classification
Not all errors are the same. A network timeout is recoverable. An authentication failure is not. A rate limit is temporary. A missing file is permanent. The agent's recovery strategy depends on what kind of error it is.
Facio's structured error responses include a classification the agent can reason about:
retryable: true— the operation might succeed if retried. Rate limits, transient network errors, temporary unavailability.retryable: false— the operation will not succeed by retrying. Bad input, missing permissions, invalid configuration.recoverable: true— a fallback action could produce a useful result. Use a different MCP server, a different model, a different approach.recoverable: false— the operation has failed and no automated fallback exists. Escalation is required.severity: warning | error | critical— the urgency of the situation. Warnings can wait; errors need attention; criticals need human eyes.
The classification is the agent's input to the recovery decision. A retryable error suggests a retry. A non-retryable error suggests a different approach. A critical error suggests escalation.
Stage 3: Attempt Fix
For retryable and recoverable errors, the agent attempts a fix before escalating. The fix can take several forms:
Retry with backoff. For transient errors, the agent waits briefly and retries. A rate limit response suggests the agent back off; a network timeout suggests an immediate retry might succeed if the underlying issue resolved.
Fallback to alternative. For recoverable errors, the agent tries a different tool, model, or MCP server. The weather API is down? Try the backup weather MCP. The flagship model is rate-limited? Switch to a cheaper model for this iteration.
Adjust parameters. For parameter-related errors, the agent adjusts the inputs. A file too large? Read in chunks. A query too complex? Decompose into smaller queries. A command too long? Split it.
Self-repair. For configuration errors, the agent can fix the configuration. A wrong path? Find the correct path with glob or grep. A missing dependency? Install it via exec. A wrong model name? Look up the correct name.
The fix attempt is bounded. The iteration budget prevents infinite retry loops. If the fix doesn't work within a few attempts, the agent moves to escalation rather than burning the budget.
Stage 4: Escalate
When the agent can't fix the issue autonomously, it escalates — but not with a generic "something went wrong." It escalates with a clear diagnosis:
ask_approval(
title="Database migration failed: need human input",
description="""The migration to add the `customer_segment` column failed with:
Error: ALTER TABLE customers ADD COLUMN customer_segment VARCHAR(50)
Detail: column 'customer_segment' already exists
Possible next steps:
1. Drop the existing column and re-run the migration (data loss risk)
2. Use ALTER TABLE ... ADD COLUMN IF NOT EXISTS (idempotent fix)
3. Investigate why the column already exists before proceeding
I recommend option 2 — it accomplishes the goal without data loss.
Approve to proceed with option 2, or reply with alternative instructions.""",
options=[
{"id": "option_2", "label": "Use IF NOT EXISTS (recommended)"},
{"id": "option_1", "label": "Drop and re-run"},
{"id": "abort", "label": "Abort migration"}
]
)
The escalation gives the human:
- What happened — the specific error and context
- Why it happened — the agent's diagnosis
- What the agent recommends — the suggested next step
- Alternatives — the human can pick a different path
This is the opposite of "the agent is stuck, please help." It's "the agent has done its analysis, here are the options, please decide."
Stage 5: Learn
Failures are learning opportunities. Facio's inline learning captures the failure and the recovery for future sessions:
edit_file(
path="memory/MEMORY.md",
old_text="## Production Deployment\n",
new_text="## Production Deployment\n- Customer table migrations: use `ADD COLUMN IF NOT EXISTS` for idempotency\n- Database connection: pool size 20 max, 5 idle (avoid connection storm)\n"
)
The lesson is captured in the memory file. The next session starts with the knowledge that migrations should use IF NOT EXISTS. The agent doesn't repeat the same mistake. The learning compounds.
For more complex lessons, the agent updates MEMORY.md with a structured "Lessons Learned" section, or adds a note to the relevant skill's documentation. The memory architecture ensures the failure isn't repeated.
What Self-Healing Looks Like in Practice
Scenario 1: The Transient Network Error
# Agent calls an external API
web_fetch(url="https://api.example.com/data")
# Response: {"status": "error", "code": "ETIMEDOUT", "retryable": true, "severity": "warning"}
# Agent's recovery:
# 1. Wait 5 seconds
# 2. Retry the request
web_fetch(url="https://api.example.com/data")
# Response: {"status": "success", ...}
Total impact: 5-second delay, no user-visible error. The agent self-heals without intervention.
Scenario 2: The MCP Server Is Down
# Agent tries to use a weather MCP server
# The tool call fails because the server is unresponsive
# Response: {"status": "error", "code": "MCP_SERVER_UNREACHABLE", "recoverable": true}
# Agent's recovery:
# 1. Check the audit trail for the server's recent behavior
read_logs(level="ERROR", since="1h", grep="weather-mcp")
# 2. Identify that the server has been failing for the last 30 minutes
# 3. Disable the failing server
manage_mcp(action="disable", name="weather-mcp")
# 4. Switch to a backup weather source (if configured)
# 5. Continue the workflow with the backup
# 6. Log the incident
edit_file(path="memory/MEMORY.md", ...) # Note: weather-mcp down as of 2026-06-17
Total impact: 30-second recovery time, no user-visible error, agent learns from the incident.
Scenario 3: The Permanent Failure
# Agent tries to deploy a configuration that's already deployed
exec(command="kubectl apply -f deployment.yaml")
# Response: {"status": "error", "code": "ALREADY_EXISTS", "retryable": false, "recoverable": true}
# Agent's recovery:
# 1. Check the current state of the deployment
exec(command="kubectl get deployment api-server -o yaml")
# 2. Compare to the desired state
# 3. Determine that the deployment is already in the desired state
# 4. Continue the workflow — no action needed
Total impact: 10-second investigation, no user-visible error, agent correctly identifies that the "failure" is actually success.
Scenario 4: The Escalation-Required Failure
# Agent tries to delete a production database
exec(command="dropdb production_customers")
# Response: {"status": "error", "code": "PERMISSION_DENIED", "retryable": false, "recoverable": false, "severity": "critical"}
# Agent's recovery:
# 1. Recognize that the operation requires elevated permissions
# 2. Escalate with a clear diagnosis:
ask_approval(
title="Permission required: delete production database",
description="""I attempted to drop the production_customers database as part of the data cleanup, but the operation was denied due to insufficient permissions.
I need explicit authorization to:
- Drop the production_customers database
- This action is irreversible — all customer data will be deleted
Are you sure you want to proceed?""",
options=[
{"id": "confirm", "label": "Yes, proceed with deletion"},
{"id": "abort", "label": "No, abort the cleanup"}
]
)
Total impact: agent pauses safely, human makes the decision, audit trail records the escalation.
The Architectural Properties That Enable Self-Healing
Self-healing isn't a feature Facio added on top. It emerges from architectural properties the runtime provides:
Structured error responses. Tools don't return free-form text on failure. They return typed errors with classification (retryable, recoverable, severity). The agent can reason about errors instead of parsing them.
Bounded retries. The iteration budget prevents infinite retry loops. The agent can attempt recovery within a bounded number of iterations, but it can't burn the entire budget retrying the same failed call.
Multi-tool access. The agent has access to read_logs, manage_mcp, switch_model, and other tools it can use for self-repair. The recovery isn't limited to "retry the same call" — it can change the approach.
HITL escalation. When self-healing isn't possible, the agent has a structured way to ask for human input. The escalation isn't "the agent is stuck" — it's "the agent has done its analysis and needs a specific decision."
Memory persistence. Lessons learned from failures are captured in MEMORY.md. The next session starts with the knowledge. Self-healing compounds over time.
Audit trail. Every failure, every recovery attempt, every escalation is logged. Operators can see patterns, identify systemic issues, and improve the agent's behavior over time.
What Self-Healing Doesn't Solve
Honest limitations:
- It can't fix architectural problems. If the agent's design is fundamentally wrong — using the wrong tool for the job, misunderstanding the user's intent — self-healing doesn't help. Those problems need human design review.
- It can't recover from model failures. If the underlying LLM produces consistently bad outputs, the agent's recovery is limited. The fix is model selection, not self-healing.
- It doesn't make every error recoverable. Some errors are catastrophic. The agent's self-healing can't bring back a deleted database. The right response is escalation, not pretending recovery is possible.
- It doesn't replace testing. Self-healing is for the unexpected. Known failure modes should be tested, not left to runtime recovery.
Bottom Line
Production AI agents will fail. The question is whether they fail like prototypes — returning errors and stopping — or like production systems — detecting, classifying, recovering, and learning.
Facio's self-healing architecture makes production-style failure the default. The runtime detects failures. The error structure classifies them. The agent attempts fixes. The iteration budget bounds the recovery attempts. The HITL tools enable clean escalation. The memory system captures the lessons. The audit trail records everything.
The result: an agent that fails like a senior engineer — recovers what it can, escalates what it can't, and gets smarter with every incident.
Because a production agent that fails perfectly is the goal. The failures are guaranteed. The recovery is the design.
See the self-healing documentation for error response schemas, recovery pattern libraries, and escalation templates.