Back to blog

Product · Jun 17, 2026

Facio's Self-Healing Agents: How Error Detection, Diagnosis, and Recovery Run as a Continuous Loop

AI agents that fail in production are a fact of life. The question isn't whether your agent will encounter an error — it's what happens after. Facio's self-healing architecture turns failures into recovery actions through a continuous detection-diagnosis-recovery loop. The agent detects, classifies, and resolves issues without human intervention for routine cases, and escalates intelligently when it can't. Here's how the loop works.

Self-HealingError RecoveryResilienceOperationsAutonomy

Facio's Self-Healing Agents: How Error Detection, Diagnosis, and Recovery Run as a Continuous Loop

Every AI agent in production will fail. A tool returns an unexpected response. An MCP server goes down. A timeout is exceeded. A rate limit is hit. An external API is unreachable. The failures are inevitable; what separates a production agent from a prototype is what happens after.

A prototype agent that hits an error returns the error to the user and stops. A production agent detects the failure, classifies it, attempts a fix, and either recovers or escalates with a clear diagnosis. The user doesn't see "something went wrong" — they see "the agent handled it" or "the agent needs your input on a specific decision."

Facio's self-healing architecture makes the second behavior the default. Through a continuous detection-diagnosis-recovery loop, the agent turns failures into recovery actions — autonomously for routine cases, with intelligent escalation for the rest. Here's how the loop works and what it enables.

The Failure Recovery Loop

Every tool call in Facio's runtime is part of an error recovery loop, whether the agent realizes it or not:

DETECT → CLASSIFY → ATTEMPT FIX → ESCALATE IF NEEDED → LEARN → REPEAT

The loop is the structural reason Facio agents behave like production systems rather than prototypes. Each stage has a defined purpose and a defined interface to the next.

Stage 1: Detection

Facio's runtime detects failures at the tool layer. Every tool call returns a structured response: success, error, timeout, blocked, or partial. The agent doesn't have to inspect outputs for failure markers; the runtime reports status explicitly.

The detection happens before the agent sees the result. By the time the response reaches the agent's context, the failure is already classified as a failure — not buried in a 200 OK response that happens to contain an error message, not hidden in a stack trace the agent has to parse.

exec(command="kubectl apply -f deployment.yaml")
# Response: {"status": "error", "code": "CONNECTION_REFUSED", "message": "API server unreachable", "retryable": true}

The agent reads the structured response. The detection is the runtime's job, not the agent's.

Stage 2: Classification

Not all errors are the same. A network timeout is recoverable. An authentication failure is not. A rate limit is temporary. A missing file is permanent. The agent's recovery strategy depends on what kind of error it is.

Facio's structured error responses include a classification the agent can reason about:

  • retryable: true — the operation might succeed if retried. Rate limits, transient network errors, temporary unavailability.
  • retryable: false — the operation will not succeed by retrying. Bad input, missing permissions, invalid configuration.
  • recoverable: true — a fallback action could produce a useful result. Use a different MCP server, a different model, a different approach.
  • recoverable: false — the operation has failed and no automated fallback exists. Escalation is required.
  • severity: warning | error | critical — the urgency of the situation. Warnings can wait; errors need attention; criticals need human eyes.

The classification is the agent's input to the recovery decision. A retryable error suggests a retry. A non-retryable error suggests a different approach. A critical error suggests escalation.

Stage 3: Attempt Fix

For retryable and recoverable errors, the agent attempts a fix before escalating. The fix can take several forms:

Retry with backoff. For transient errors, the agent waits briefly and retries. A rate limit response suggests the agent back off; a network timeout suggests an immediate retry might succeed if the underlying issue resolved.

Fallback to alternative. For recoverable errors, the agent tries a different tool, model, or MCP server. The weather API is down? Try the backup weather MCP. The flagship model is rate-limited? Switch to a cheaper model for this iteration.

Adjust parameters. For parameter-related errors, the agent adjusts the inputs. A file too large? Read in chunks. A query too complex? Decompose into smaller queries. A command too long? Split it.

Self-repair. For configuration errors, the agent can fix the configuration. A wrong path? Find the correct path with glob or grep. A missing dependency? Install it via exec. A wrong model name? Look up the correct name.

The fix attempt is bounded. The iteration budget prevents infinite retry loops. If the fix doesn't work within a few attempts, the agent moves to escalation rather than burning the budget.

Stage 4: Escalate

When the agent can't fix the issue autonomously, it escalates — but not with a generic "something went wrong." It escalates with a clear diagnosis:

ask_approval(
    title="Database migration failed: need human input",
    description="""The migration to add the `customer_segment` column failed with:
    
    Error: ALTER TABLE customers ADD COLUMN customer_segment VARCHAR(50)
    Detail: column 'customer_segment' already exists
    
    Possible next steps:
    1. Drop the existing column and re-run the migration (data loss risk)
    2. Use ALTER TABLE ... ADD COLUMN IF NOT EXISTS (idempotent fix)
    3. Investigate why the column already exists before proceeding
    
    I recommend option 2 — it accomplishes the goal without data loss.
    Approve to proceed with option 2, or reply with alternative instructions.""",
    options=[
        {"id": "option_2", "label": "Use IF NOT EXISTS (recommended)"},
        {"id": "option_1", "label": "Drop and re-run"},
        {"id": "abort", "label": "Abort migration"}
    ]
)

The escalation gives the human:

  • What happened — the specific error and context
  • Why it happened — the agent's diagnosis
  • What the agent recommends — the suggested next step
  • Alternatives — the human can pick a different path

This is the opposite of "the agent is stuck, please help." It's "the agent has done its analysis, here are the options, please decide."

Stage 5: Learn

Failures are learning opportunities. Facio's inline learning captures the failure and the recovery for future sessions:

edit_file(
    path="memory/MEMORY.md",
    old_text="## Production Deployment\n",
    new_text="## Production Deployment\n- Customer table migrations: use `ADD COLUMN IF NOT EXISTS` for idempotency\n- Database connection: pool size 20 max, 5 idle (avoid connection storm)\n"
)

The lesson is captured in the memory file. The next session starts with the knowledge that migrations should use IF NOT EXISTS. The agent doesn't repeat the same mistake. The learning compounds.

For more complex lessons, the agent updates MEMORY.md with a structured "Lessons Learned" section, or adds a note to the relevant skill's documentation. The memory architecture ensures the failure isn't repeated.

What Self-Healing Looks Like in Practice

Scenario 1: The Transient Network Error

# Agent calls an external API
web_fetch(url="https://api.example.com/data")
# Response: {"status": "error", "code": "ETIMEDOUT", "retryable": true, "severity": "warning"}

# Agent's recovery:
# 1. Wait 5 seconds
# 2. Retry the request
web_fetch(url="https://api.example.com/data")
# Response: {"status": "success", ...}

Total impact: 5-second delay, no user-visible error. The agent self-heals without intervention.

Scenario 2: The MCP Server Is Down

# Agent tries to use a weather MCP server
# The tool call fails because the server is unresponsive
# Response: {"status": "error", "code": "MCP_SERVER_UNREACHABLE", "recoverable": true}

# Agent's recovery:
# 1. Check the audit trail for the server's recent behavior
read_logs(level="ERROR", since="1h", grep="weather-mcp")
# 2. Identify that the server has been failing for the last 30 minutes
# 3. Disable the failing server
manage_mcp(action="disable", name="weather-mcp")
# 4. Switch to a backup weather source (if configured)
# 5. Continue the workflow with the backup

# 6. Log the incident
edit_file(path="memory/MEMORY.md", ...)  # Note: weather-mcp down as of 2026-06-17

Total impact: 30-second recovery time, no user-visible error, agent learns from the incident.

Scenario 3: The Permanent Failure

# Agent tries to deploy a configuration that's already deployed
exec(command="kubectl apply -f deployment.yaml")
# Response: {"status": "error", "code": "ALREADY_EXISTS", "retryable": false, "recoverable": true}

# Agent's recovery:
# 1. Check the current state of the deployment
exec(command="kubectl get deployment api-server -o yaml")
# 2. Compare to the desired state
# 3. Determine that the deployment is already in the desired state
# 4. Continue the workflow — no action needed

Total impact: 10-second investigation, no user-visible error, agent correctly identifies that the "failure" is actually success.

Scenario 4: The Escalation-Required Failure

# Agent tries to delete a production database
exec(command="dropdb production_customers")
# Response: {"status": "error", "code": "PERMISSION_DENIED", "retryable": false, "recoverable": false, "severity": "critical"}

# Agent's recovery:
# 1. Recognize that the operation requires elevated permissions
# 2. Escalate with a clear diagnosis:
ask_approval(
    title="Permission required: delete production database",
    description="""I attempted to drop the production_customers database as part of the data cleanup, but the operation was denied due to insufficient permissions.
    
    I need explicit authorization to:
    - Drop the production_customers database
    - This action is irreversible — all customer data will be deleted
    
    Are you sure you want to proceed?""",
    options=[
        {"id": "confirm", "label": "Yes, proceed with deletion"},
        {"id": "abort", "label": "No, abort the cleanup"}
    ]
)

Total impact: agent pauses safely, human makes the decision, audit trail records the escalation.

The Architectural Properties That Enable Self-Healing

Self-healing isn't a feature Facio added on top. It emerges from architectural properties the runtime provides:

Structured error responses. Tools don't return free-form text on failure. They return typed errors with classification (retryable, recoverable, severity). The agent can reason about errors instead of parsing them.

Bounded retries. The iteration budget prevents infinite retry loops. The agent can attempt recovery within a bounded number of iterations, but it can't burn the entire budget retrying the same failed call.

Multi-tool access. The agent has access to read_logs, manage_mcp, switch_model, and other tools it can use for self-repair. The recovery isn't limited to "retry the same call" — it can change the approach.

HITL escalation. When self-healing isn't possible, the agent has a structured way to ask for human input. The escalation isn't "the agent is stuck" — it's "the agent has done its analysis and needs a specific decision."

Memory persistence. Lessons learned from failures are captured in MEMORY.md. The next session starts with the knowledge. Self-healing compounds over time.

Audit trail. Every failure, every recovery attempt, every escalation is logged. Operators can see patterns, identify systemic issues, and improve the agent's behavior over time.

What Self-Healing Doesn't Solve

Honest limitations:

  • It can't fix architectural problems. If the agent's design is fundamentally wrong — using the wrong tool for the job, misunderstanding the user's intent — self-healing doesn't help. Those problems need human design review.
  • It can't recover from model failures. If the underlying LLM produces consistently bad outputs, the agent's recovery is limited. The fix is model selection, not self-healing.
  • It doesn't make every error recoverable. Some errors are catastrophic. The agent's self-healing can't bring back a deleted database. The right response is escalation, not pretending recovery is possible.
  • It doesn't replace testing. Self-healing is for the unexpected. Known failure modes should be tested, not left to runtime recovery.

Bottom Line

Production AI agents will fail. The question is whether they fail like prototypes — returning errors and stopping — or like production systems — detecting, classifying, recovering, and learning.

Facio's self-healing architecture makes production-style failure the default. The runtime detects failures. The error structure classifies them. The agent attempts fixes. The iteration budget bounds the recovery attempts. The HITL tools enable clean escalation. The memory system captures the lessons. The audit trail records everything.

The result: an agent that fails like a senior engineer — recovers what it can, escalates what it can't, and gets smarter with every incident.

Because a production agent that fails perfectly is the goal. The failures are guaranteed. The recovery is the design.


See the self-healing documentation for error response schemas, recovery pattern libraries, and escalation templates.

Keep reading

More on Product

View category
Jun 16, 2026Product

Facio's Two-Layer Memory: How Passive Context and Active Recall Give AI Agents Institutional Knowledge

AI agents with no memory are amnesiacs — they start every session knowing nothing about the user, the project, or the prior work. Facio's two-layer memory architecture combines passive context (always loaded) with active recall (query-based search) to give agents institutional knowledge that grows over time. Here's how the layers work, what each is good for, and why the combination beats either approach alone.

Jun 15, 2026Product

Facio's Iteration Budget: How Bounded Reasoning Stops AI Agents From Spiraling Into Costly Loops

An AI agent without a budget is a financial accident waiting to happen. A clever agent that gets stuck in a retry loop, asks the same question 200 times, or follows a confused thread of reasoning for hours can burn thousands of dollars in tokens before a human notices. Facio's iteration budget is a runtime-enforced bound on reasoning length — the architectural pressure that turns "keep trying forever" into "be efficient and escalate when stuck." Here's how bounded reasoning works and why it matters.

Jun 14, 2026Product

Why Facio Is Built in the EU: How DSGVO-Native Architecture Removes Compliance Friction From AI Agents

Most AI agent platforms route customer data through US-hosted infrastructure by default. For European businesses, that's a compliance incident waiting to happen. Facio is built in the EU, for the EU, and the architecture is DSGVO-native from the ground up — data residency, processing boundaries, audit trails, and HITL gating all designed around European data protection requirements. Here's what DSGVO-native actually means and why it matters for production agents.