Facio's Graceful Degradation: How AI Agents Keep Working When the Stack Around Them Breaks
Production AI agents operate in hostile environments. The MCP server that worked yesterday returns 503s today. The API that responded in 200ms is now timing out at 30 seconds. The database connection drops mid-query. The network blip terminates the agent's file upload. The integration that was reliable last week fails today.
The agent that fails on the first hiccup is a prototype. The agent that gracefully degrades — falls back to alternatives, retries intelligently, escalates when needed, and continues with what it has — is a production system.
Facio's graceful degradation patterns are the structural discipline that keeps agents working when the world around them isn't. The patterns aren't a single feature; they're a set of techniques the agent applies when things go wrong: retry with backoff, fallback to alternatives, partial completion, escalation, and structured logging. Combined, they turn a brittle prototype into a resilient production agent.
Here's how the patterns work, when to apply each, and why degradation discipline is what separates agents that survive contact with reality from agents that don't.
The Hostile Environment Reality
Production environments are hostile. The agent's dependencies — APIs, databases, MCP servers, file systems, networks — are all potential failure points. The agent's job is to deliver value despite these failures.
The failure modes are predictable:
- Transient failures. Network blips, momentary server overload, brief network partitions. These resolve in seconds or minutes. The retry pattern handles them.
- Resource exhaustion. API rate limits, connection pool exhaustion, disk full, memory pressure. These require waiting, scaling, or alternative resources.
- Service degradation. Slow responses, partial outages, intermittent errors. These require fallback or degraded functionality.
- Hard failures. Server down, authentication revoked, schema changed, configuration lost. These require escalation or alternative approaches.
- Cascading failures. One service fails, others downstream fail because of the first failure. These require circuit breakers and isolation.
A naive agent treats every failure the same way: error, abort, return to user. The user gets a non-functional experience, has to retry, and loses confidence in the agent.
A graceful agent treats each failure with the appropriate response. The result is an agent that the user trusts because it works even when the underlying systems don't.
The Five Graceful Degradation Patterns
Facio's graceful degradation discipline has five patterns. Each addresses a specific failure mode.
Pattern 1: Retry with Exponential Backoff
For transient failures, the agent retries with increasing delays between attempts:
# First attempt
web_fetch(url="https://api.example.com/data")
# Response: 503 Service Unavailable (transient)
# Wait 1 second
sleep(1)
# Second attempt
web_fetch(url="https://api.example.com/data")
# Response: 503 Service Unavailable (transient)
# Wait 2 seconds (exponential)
sleep(2)
# Third attempt
web_fetch(url="https://api.example.com/data")
# Response: 200 OK (recovered)
The retry pattern has three parameters:
- Max retries. Default 3. More than 5 wastes time on genuinely broken services.
- Base delay. Default 1 second. Longer for rate-limited services, shorter for fast-recovering services.
- Jitter. Add randomness to prevent thundering herd. A 10-20% random offset on the base delay.
The retry pattern works for transient failures and short resource exhaustion. It doesn't work for hard failures or persistent service degradation.
Pattern 2: Fallback to Alternative
For degraded or hard failures, the agent falls back to a configured alternative:
# Primary: Slack MCP server (down)
# Fallback: Email via SMTP
# Try primary first
slack.send_message(channel="#alerts", text="...")
# Response: MCP server connection refused
# Fall back to email
exec(command="curl --ssl smtps://smtp.gmail.com --mail-from alerts@example.com --mail-rcpt team@example.com --upload-file - <<< 'Alert: ...'")
# Response: 250 Message accepted
The fallback pattern requires the agent to know:
- What alternative to use. Configured in the agent's setup or learned from experience.
- When to use the alternative. Triggered by specific failure modes (connection refused, timeout, auth failure).
- How to format for the alternative. Different channels have different formats; the agent adapts.
The fallback pattern is most effective when the alternative is genuinely equivalent (sending a message via Slack vs. email) and least effective when the alternative is degraded (a slower API that works).
Pattern 3: Partial Completion
When a workflow has independent steps, the agent completes what it can and reports what it couldn't:
# Multi-step task: deploy + run tests + notify
# Step 1: Deploy (succeeds)
exec(command="kubectl apply -f deployment.yaml")
# Result: deployment configured
# Step 2: Run tests (fails)
exec(command="npm test")
# Result: Test environment not available
# Step 3: Notify (succeeds)
message(content="Deployment complete. Tests skipped due to environment unavailable.", channel="placet")
# Agent response:
"Deployed to production. Tests could not run because the test environment was unavailable. Notification sent. Want me to retry the tests once the environment is back, or proceed without verification?"
The partial completion pattern requires:
- Identifying independent steps. The agent reasons about which steps depend on which.
- Tracking completion state. The agent records what succeeded and what didn't.
- Reporting honestly. The user gets a clear picture of what was done and what wasn't.
The pattern is most valuable for long workflows where abandoning the entire task on a single failure is wasteful.
Pattern 4: Circuit Breaker
For cascading failures, the agent stops calling a known-broken service temporarily:
# Service is failing repeatedly
# After 3 failures in 60 seconds: open the circuit
circuit_breaker_open("mcp-server-x", duration_seconds=300)
# Next 5 minutes: don't try mcp-server-x
# Use fallback (Pattern 2) for any task that needed mcp-server-x
# After 5 minutes: half-open the circuit
# Try one call to mcp-server-x
# If success: close the circuit (back to normal)
# If failure: open the circuit for another 5 minutes
The circuit breaker pattern prevents the agent from:
- Wasting time on a service that is clearly broken
- Adding load to a struggling service that may be recovering
- Cascading failures by relying on broken dependencies
The pattern is essential when the agent calls external services that can become overwhelmed. A naive agent retries forever; a circuit-breaker agent backs off and gives the service room to recover.
Pattern 5: Escalation
For hard failures the agent can't handle, the agent escalates to a human:
# Deployment requires database migration
# Migration requires manual approval per company policy
# Agent has been instructed to never auto-approve migrations
ask_approval(
title="Manual approval required: production database migration",
description="""The deployment requires migrating the production database to add the new user_segment column. Per company policy, this requires manual approval.
Estimated downtime: 2 minutes
Rollback plan: Reverse migration script ready
Please review and approve.""",
options=[
{"id": "approve", "label": "Approve and run migration"},
{"id": "defer", "label": "Defer migration to scheduled window"},
{"id": "cancel", "label": "Cancel deployment"}
]
)
The escalation pattern is for failures the agent can't recover from autonomously:
- Policy boundaries (the agent isn't authorized to do this)
- High-stakes decisions (the cost of getting it wrong is too high)
- Missing information (the agent doesn't have what it needs)
- Persistent failures (the retries and fallbacks have all failed)
The escalation includes structured context: what was tried, what failed, what the human needs to decide. The human can respond without re-investigating.
The Pattern Selection Logic
The agent's choice of pattern depends on the failure type and context:
Transient failure? → Retry with backoff
Persistent failure of one service? → Fallback to alternative
Independent steps, partial success? → Partial completion
Recurring failure across retries? → Circuit breaker
Hard failure, policy boundary, or human decision needed? → Escalation
The selection happens automatically as part of the agent's decision-making. The agent doesn't have to be explicitly told to use each pattern; the patterns are baked into the runtime's error handling.
The Degradation Discipline in Practice
Consider a customer onboarding workflow that integrates with five external systems:
# Workflow:
1. Create account in CRM (Salesforce)
2. Provision database (internal)
3. Send welcome email (SendGrid)
4. Create Slack channel (Slack)
5. Schedule kickoff call (Calendly)
If Salesforce is down for 10 minutes, what happens?
Naive agent:
# Create account in Salesforce
salesforce.create_account(...)
# Error: Salesforce API timeout
# Return error to user
# Workflow aborted
Facio agent with degradation:
# Try Salesforce
salesforce.create_account(...)
# Error: 503 Service Unavailable
# Apply Pattern 1: Retry with backoff
sleep(2)
salesforce.create_account(...)
# Error: 503 Service Unavailable
sleep(4)
salesforce.create_account(...)
# Success: account created
# Continue workflow
# Provision database (succeeds)
# Send welcome email (succeeds)
# Create Slack channel (fails - circuit breaker opens)
circuit_breaker_open("slack", duration_seconds=300)
# Fall back to email for channel creation announcement
send_email(to="team@example.com", subject="New customer channel pending", body="...")
# Schedule kickoff call (succeeds via Calendly API)
# Final report to user:
"Customer onboarded. Salesforce account created (after retry). Database provisioned. Welcome email sent. Slack channel creation deferred (Slack was down); email sent to team instead. Kickoff call scheduled. Want me to retry Slack channel creation in 5 minutes?"
The user gets value despite the partial failures. The agent's discipline turns a brittle workflow into a resilient one.
The Degradation Audit Trail
Every degradation event is logged. The audit trail captures:
- What failed. The tool call, the error response, the failure mode classification.
- What pattern was applied. Retry, fallback, partial completion, circuit breaker, escalation.
- What the outcome was. Recovery, fallback success, partial completion, or escalation outcome.
- How long it took. The wall-clock time from failure to resolution.
The audit trail enables:
- Pattern analysis. Which failures happen most? Which patterns work best? Which services are flaky?
- Capacity planning. When the circuit breaker opens frequently for a service, that's a signal the service is unreliable and needs attention.
- User communication. The user can see what happened and why, not just "the workflow partially succeeded."
- Improvement targets. The patterns that fail most often point to systemic issues worth fixing.
What Graceful Degradation Doesn't Do
Honest limitations:
- It doesn't make the agent more capable. The agent can do what it was designed to do; degradation lets it do that even when the environment fails. It doesn't extend the agent's reach into new capabilities.
- It doesn't fix the underlying problem. The agent works around the failure; the failure itself still needs addressing (by humans, on a different timeline).
- It doesn't eliminate user intervention. Some failures require user input. The degradation patterns reduce these but can't eliminate them.
- It can mask problems. If the agent is too good at working around failures, the underlying service issues might not get the attention they need. The audit trail is the countermeasure.
- It adds latency. Retries, fallbacks, and circuit breakers all add time. The discipline is about trading latency for reliability.
The Compound Effect of Degradation Discipline
Graceful degradation compounds. The agent with discipline:
- Completes more workflows. Partial completion + fallback + retry means more tasks finish successfully.
- Builds user trust. The user sees the agent working through problems, not giving up at the first hiccup.
- Reduces operational burden. The on-call team gets fewer pages because the agent handles most issues autonomously.
- Improves the system. The audit trail identifies systemic issues; the team fixes them; the patterns become less necessary over time.
The agent without discipline has the opposite trajectory. Workflows fail frequently. Users lose trust. Operational burden increases. The system never improves because the failures aren't visible.
Bottom Line
Production AI agents operate in hostile environments. Failures happen. The question isn't whether — it's how the agent responds.
Facio's graceful degradation patterns give agents a structured response: retry for transient failures, fallback for degraded services, partial completion for independent steps, circuit breaker for cascading failures, and escalation for hard limits. The patterns work together to keep the agent productive despite the environment's failures.
The agent without degradation discipline is brittle. The agent with discipline is resilient. The user trusts the resilient one. The operations team prefers the resilient one. The architecture rewards the resilient one.
Because production agents aren't measured by how they work when everything is fine. They're measured by how they work when everything isn't. The degradation discipline is the difference.
See the graceful degradation documentation for pattern configuration, circuit breaker tuning, and fallback chain specifications.