HITL for Production Incidents: When the Agent Becomes the On-Call Rotation
At 3am on a Saturday, your agent detects an anomaly. CPU spikes on three production servers. Latency is degrading. Customer-facing requests are timing out. The agent proposes a remediation: restart the affected services. The action requires human approval.
The on-call engineer is asleep. The secondary on-call is at a wedding. The escalation chain leads to a manager who is on a flight with no WiFi. The action has a 15-minute rollback window. The clock is ticking.
This is HITL for production incidents. The agent is the first responder. The human is the gate. The window is closing. The stakes are the highest they ever get — and the reviewer is at their worst.
Most HITL designs don't account for the incident context. They treat the 3am production incident the same as the 10am routine approval. The reviewer is expected to make a careful decision in a context designed to defeat careful decisions. The system is optimized for the steady state, not the failure state.
This post covers how to design HITL for production incidents — the context that makes incident review different from routine review, the patterns that keep oversight genuine under pressure, and the architecture that survives the 3am scenario.
What Makes Incident HITL Different
Incident HITL is different from routine HITL in five dimensions:
Dimension 1: The Reviewer Is Impaired
At 3am on a Saturday, the on-call reviewer is not at their best. They were woken up. They may have been drinking. They are cognitively impaired. The patterns that work for routine review — careful reasoning, contextual evaluation, structured thinking — break down under sleep deprivation and interruption.
The incident reviewer is also under time pressure. The rollback window is closing. The action needs approval now. The reviewer doesn't have the luxury of careful deliberation.
The system must account for this. The reviewer can't be expected to make a careful decision in conditions designed to defeat careful decisions.
Dimension 2: The Stakes Are Asymmetric
Routine HITL failures cause local harm. A wrong refund is processed. A wrong email is sent. The harm is recoverable.
Incident HITL failures cause systemic harm. A wrong production remediation can take down the entire system. A wrong database migration can corrupt customer data. A wrong security response can lock out legitimate users. The harm may be irreversible.
The asymmetry means the cost of a wrong decision is much higher. The system needs stronger guardrails for incident HITL than for routine HITL.
Dimension 3: The Action Space Is Unfamiliar
Routine HITL reviews familiar action types. The reviewer has seen hundreds of similar actions. The pattern is recognized. The decision is informed by the pattern.
Incident HITL reviews unfamiliar action types. The remediation may be a new playbook. The action may be outside the reviewer's normal domain. The reviewer is being asked to evaluate something they don't fully understand.
The unfamiliarity means the reviewer's pattern recognition doesn't help. The reviewer must reason from first principles — which they're impaired to do.
Dimension 4: The Data Volume Is Overwhelming
Routine HITL reviews single actions with limited context. The reviewer can hold the relevant context in their head.
Incident HITL reviews actions in the context of an ongoing incident. The dashboards are full of alerts. The logs are streaming. The customer reports are coming in. The reviewer is asked to make a decision while the situation is evolving.
The data volume means the reviewer can't see the full picture. The decision is made with the data the system happens to surface — not the data the reviewer needs.
Dimension 5: The Time Pressure Is Real
Routine HITL has time. The reviewer can take 30 seconds or 2 minutes. The action is queued. The customer waits. The decision is unhurried.
Incident HITL has no time. The action must be approved now or the rollback window closes. The system is degrading. The customer is being impacted. The reviewer is making a decision under pressure.
The time pressure means the reviewer's reasoning is abbreviated. The decision is made faster than the situation deserves.
The Incident HITL Design Patterns
The patterns that make incident HITL work address each of the five differences:
Pattern 1: Pre-Approved Playbook Actions
For well-understood incident playbooks, the actions are pre-approved. The agent executes the playbook action without waiting for human review. The human is informed, but not asked.
The pre-approved pattern works for actions that:
- Have been executed hundreds of times in similar incidents
- Have a well-understood blast radius
- Have a tested rollback
- Have a clear set of triggering conditions
A restart of a known-failing service is pre-approved. A database failover to a tested replica is pre-approved. A rollback to a known-good deployment is pre-approved. The reviewer doesn't have to be woken up for actions that the system has been designed to handle autonomously.
Pattern 2: Reduced Review for Tier 1 Actions
For actions that need review but are reversible, the review is reduced. The reviewer approves a class of actions, not an individual action. The class is approved in advance, with specific conditions.
A reviewer can pre-approve "restart services with rolling deployment" during an incident, without approving each individual restart. The class of actions is approved. The individual actions execute under the class approval.
The reduced review pattern works for actions that:
- Are reversible within the rollback window
- Have predictable effects
- Have been executed under similar conditions
Pattern 3: Two-Reviewer for Tier 3 Actions
For high-stakes actions that need approval, the two-reviewer pattern is used. The on-call reviewer and the secondary reviewer must both approve. The action only executes when both approve.
The two-reviewer pattern works for actions that:
- Have irreversible consequences
- Have large blast radius
- Have unclear effects
- Are outside the routine playbook
The two-reviewer pattern adds latency, but the latency is justified by the stakes. The system waits for two reviewers. The reviewers are easier to find (multiple escalation chains can be triggered in parallel). The decision is more likely to be correct because two reviewers must agree.
Pattern 4: Context-Preserving Review Interface
The incident review interface presents the action in the context of the incident. The reviewer sees:
- The current incident state (alerts, dashboards, customer impact)
- The proposed action and its expected effect
- The rollback procedure if the action fails
- The history of similar actions in similar incidents
- The agent's reasoning for why this action is the right one now
The interface is designed for the impaired reviewer. The information is structured for quick comprehension. The critical details are at the top. The irrelevant details are hidden.
Pattern 5: Slow-Down Friction for Critical Actions
For the highest-stakes actions, the interface adds friction. The reviewer cannot approve in less than 60 seconds. The reviewer must confirm the rollback procedure. The reviewer must type a confirmation phrase.
The friction is deliberate. The incident context encourages fast decisions. The friction slows the reviewer down. The slowing is what makes the decision a decision, not a rubber stamp.
Pattern 6: Automatic Rollback on Timeout
If the reviewer doesn't respond within the incident window, the action does not execute. The action is held. The escalation continues. The rollback procedure is initiated.
The default for incident HITL is not auto-approve on timeout. The default is hold and escalate. The cost of holding is the customer impact of the unresolved incident. The cost of auto-approving a wrong action is the customer impact of the wrong remediation. The holding is the safer default.
Pattern 7: Post-Incident Reviewer Calibration
After the incident, the reviewers who made the decisions are debriefed. The debrief assesses the decision quality, the context adequacy, the system's support. The calibration is used to improve the next incident.
The debrief is blameless. The reviewer was impaired, time-pressured, and working with imperfect context. The debrief identifies system improvements that would have made the decision easier.
The Pre-Incident Preparation
The patterns above work best when the system is prepared before the incident. The preparation includes:
Preparation 1: Pre-Approved Playbook Catalog
The team defines the playbook actions that are pre-approved for incidents. The catalog is reviewed regularly. The catalog includes the conditions for pre-approval and the rollback procedures.
The catalog reduces the number of actions that need reviewer approval during the incident. The reviewers focus on the actions that aren't pre-approved — the novel, the uncertain, the high-stakes.
Preparation 2: Reviewer Training on Incident Scenarios
The reviewers are trained on incident scenarios. The training simulates the conditions — sleep deprivation, time pressure, alert overload. The training builds the muscle memory for making decisions under pressure.
The training is not optional. Every reviewer who participates in incident response receives the training. The training is refreshed quarterly.
Preparation 3: Interface Drills
The incident review interface is drilled. The reviewers know where to look. The reviewers know what each field means. The reviewers know the friction points (the 60-second minimum, the confirmation phrase).
The drills are quick — 5-minute exercises that walk through the interface. The drills are repeated until the interface is second nature.
Preparation 4: Escalation Chain Testing
The escalation chain is tested. The on-call rotation, the secondary rotation, the manager escalation, the executive escalation — all tested. The testing identifies gaps (the secondary on-call who is unreachable, the manager who is on a flight).
The gaps are fixed. The chain is reliable. The escalation works when needed.
Preparation 5: Post-Incident Review Process
The post-incident review process is defined. The debrief is structured. The blameless review is documented. The improvements are tracked.
The process ensures that the lessons from each incident become system improvements. The incidents become investments in the system's future.
The Incident HITL Architecture
The incident HITL architecture has these components:
┌─────────────────────────────────────────────────────────────┐
│ Incident Detection │
│ (Agent monitors, classifies) │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Playbook Lookup │
│ (Match incident to playbook) │
└──────────────────────────┬──────────────────────────────────┘
│
┌────────────┴────────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Pre-Approved │ │ Review-Required │
│ Playbook Action │ │ Action │
│ │ │ │
│ Execute and log │ │ Tier 1: Reduced │
│ (no human needed) │ │ Tier 2: Standard │
│ │ │ Tier 3: Two-Reviewer │
└─────────────────────┘ └──────────┬──────────┘
│
▼
┌───────────────────────┐
│ Incident Review UI │
│ - Incident context │
│ - Action + rollback │
│ - Friction for tier 3 │
│ - Hold on timeout │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ Execution + Audit │
│ (Approved action runs)│
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ Post-Incident Review │
│ (Blameless debrief) │
└───────────────────────┘
The architecture has three paths from incident detection to action execution: pre-approved playbook actions, review-required actions, and the post-incident review. The architecture is designed for the impaired reviewer — the interface, the friction, the hold-on-timeout default, the post-incident learning.
The On-Call Reviewer Experience
The on-call reviewer experience has five stages:
Stage 1: Alert and Wake
The reviewer is paged. The page includes the incident summary, the proposed action, the urgency. The reviewer wakes up (or wakes up more). The reviewer's first task is to acknowledge the page.
Stage 2: Context Load
The reviewer opens the incident interface. The interface loads the incident context — the alerts, the dashboards, the recent actions. The reviewer has 30 seconds to absorb the context before the action requires decision.
Stage 3: Action Review
The reviewer sees the proposed action. The interface shows the action, the reasoning, the expected effect, the rollback procedure. The reviewer has the friction time (60 seconds for tier 3) to evaluate.
Stage 4: Decision and Acknowledgment
The reviewer makes the decision. The decision is approve, reject, modify, or escalate. The reviewer acknowledges the decision with a confirmation. The audit trail records the decision with the incident context.
Stage 5: Post-Incident Review
After the incident, the reviewer participates in the post-incident review. The debrief is blameless. The improvements are tracked. The reviewer is supported.
The experience is designed to make the impaired reviewer's decision as informed as possible. The interface provides the context. The friction prevents rubber stamping. The hold-on-timeout prevents wrong auto-approval. The post-incident review learns from the decision.
The Metrics for Incident HITL
The metrics for incident HITL:
Time to Decision
How long from page to decision? The metric should be fast for tier 1 (reduced review) and slower for tier 3 (two-reviewer). The metric is tracked per reviewer, per action type, per incident severity.
Decision Quality
How often were the incident decisions correct? The metric is measured by post-incident outcomes (the action resolved the incident, the action didn't, the action created a new incident).
Friction Effectiveness
Did the friction prevent wrong decisions? The metric is the rate of wrong decisions with and without friction. The friction is effective if the wrong-decision rate is lower with friction.
Escalation Effectiveness
How often did the escalation chain work? The metric is the rate of successful escalation (the next reviewer responded) vs. failed escalation (no response, escalation timed out).
Post-Incident Improvement Rate
How often do the post-incident reviews produce system improvements? The metric is the number of improvements per incident, the time to implement, the impact of the improvements.
The metrics drive the continuous improvement of the incident HITL system.
Where Facio Fits
Facio's policy engine encodes the pre-approved playbook and the tier classification. The manifest specifies which actions are pre-approved, which need tier 1 review, which need tier 2, which need tier 3. The manifest is reviewed and approved by the on-call leadership.
Placet.io's incident review interface is designed for the impaired reviewer. The incident context is at the top. The action and rollback are highlighted. The friction is calibrated per tier. The hold-on-timeout is the default.
The audit trail captures the incident decisions. The reviewer, the context, the decision, the outcome. The post-incident review uses the audit trail to understand what happened and why.
The continuous calibration detects reviewer drift and incident system drift. The patterns in the decisions drive the playbook updates, the reviewer training, the interface improvements.
The 3am scenario is the test of any HITL system. Facio is designed for it.
Key Takeaways
- Incident HITL is the highest-stakes variant — impaired reviewer, asymmetric stakes, unfamiliar action, overwhelming data, real time pressure
- Five dimensions make it different: impaired reviewer, asymmetric stakes, unfamiliar action, overwhelming data, real time pressure
- Seven patterns address the differences: pre-approved playbooks, reduced review, two-reviewer for critical, context-preserving interface, slow-down friction, hold-on-timeout, post-incident calibration
- Five pre-incident preparations: playbook catalog, reviewer training, interface drills, escalation testing, post-incident process
- The hold-on-timeout default is the safer default — don't auto-approve wrong actions in the incident context
- Post-incident review is blameless — the system improvements come from the incident, not the blame for the decision
- Facio + Placet.io are designed for the 3am scenario — pre-approved playbooks, tier classification, context-preserving interface, friction calibrated per tier, hold-on-timeout default
Sources: The incident HITL design draws on SRE practices (Google SRE book, incident response procedures), the documented patterns of on-call decision-making under pressure, the human factors research on decision-making under sleep deprivation and time pressure, and the production incident response practices of high-reliability organizations during 2025-2026.