Human-in-the-loop · Jun 9, 2026

When AI Agents Talk to Customers: The HITL Architecture Behind Refunds, Cancellations, and Account Changes

A customer-service AI agent just approved a $50,000 refund — to a fraudulent account. The CFO wants answers. Here is the four-tier HITL architecture that lets customer-facing agents move at conversational speed for routine cases and force a synchronous human checkpoint for everything that can hurt the business.

HITLCustomer ServiceAI AgentsApproval WorkflowTrust CalibrationRisk TiersEU AI Act

When AI Agents Talk to Customers: The HITL Architecture Behind Refunds, Cancellations, and Account Changes

Your customer-service AI agent just approved a $50,000 refund — to a fraudulent account. The CFO wants to know how a system made a five-figure financial decision without a human in the loop. Meanwhile, a cancellation agent in another company has just deleted twelve months of usage history for a B2B account because the user said "please cancel everything" and the model complied.

These are not hypotheticals. They are the reason customer-facing HITL is the most architecturally demanding version of human-in-the-loop oversight. Internal agents can fail quietly. Customer-facing agents fail in front of the people who pay you — and in front of the people who defraud you.

This post covers the architecture pattern we recommend for production customer-facing agents: a four-tier risk model, a synchronous-vs-asynchronous choice for each tier, and an explicit override hierarchy that keeps a single human reviewer from becoming a single point of failure. It is the same pattern we built into Facio's runtime and Placet's approval inbox, because it is the pattern that survives contact with real customer traffic.

Why customer-facing HITL is a different animal

Internal HITL — the kind this blog has covered in the maturity model, the synchronous-vs-asynchronous analysis, and the multi-agent coordination post — has one luxury that customer-facing HITL does not: time. A junior accountant can wait eight minutes for a senior accountant to review a journal entry. A customer waiting for a refund cannot.

Customer-facing HITL is constrained by three forces that internal HITL is not:

Latency budgets measured in seconds, not minutes. A customer on a support chat will tolerate a 10-second pause for a small refund approval. They will not tolerate three minutes of "an agent will be with you shortly" while the human reviewer finds the right queue. Hit the latency budget and the customer escalates — usually by phone — defeating the entire point of the agent.
Volume asymmetry. A support agent that handles 80% of cases autonomously and escalates 20% is a hero. A support agent that escalates 20% of cases creates a human review backlog that grows quadratically with traffic, not linearly. Klarna's AI assistant famously handled two-thirds of customer-service chats in its first month — and then had to rehire human agents when quality and edge cases caught up with the volume. The lesson is not "do not use AI." The lesson is "do not let escalation rates become a fixed cost."
Two-way adversarial pressure. Internal HITL fails because of bugs and edge cases. Customer-facing HITL fails because of adversarial customers — fraudsters, social engineers, and people who have memorized the script. The approval threshold that catches the legitimate "$200 refund for a damaged item" cannot be the same threshold that catches the legitimate-looking "$50,000 refund to a new account with a freshly changed payment method."

These three forces together are why a flat "approve everything over $X" rule does not work. You need a risk-tiered HITL architecture that picks its battles.

The four-tier risk model

Every customer-facing agent action falls into one of four tiers, and the tier — not the action type — determines the HITL treatment. This is the part that most teams get wrong: they tier by what the agent is doing (refund, cancellation, account change) instead of by what could go wrong.

Tier	Definition	Examples	HITL treatment
T1 — Routine	Reversible, low-value, low-reputation.	Read account status. Send a templated reply. Update a non-financial preference.	Fully autonomous. No approval gate. Async audit log only.
T2 — Recoverable	Reversible within a short window, or low-to-medium financial impact.	Issue a refund under a per-customer cap. Extend a free trial by 7 days. Reset a password.	Asynchronous approval. Agent acts, queues a review ticket, human can roll back within a configurable window.
T3 — Material	Hard to reverse, medium-to-high financial impact, or trust-sensitive.	Refund over the per-customer cap. Account cancellation for a paying customer. Subscription downgrade with proration.	Synchronous approval. Agent pauses, presents a structured approval request, resumes only after a qualified reviewer signs off.
T4 — Catastrophic	Irreversible, large financial impact, or regulatory exposure.	Refunds above the human-discretion threshold. B2B account terminations. Anything touching PCI, GDPR, or the EU AI Act's Article 14 high-risk category.	Synchronous approval, dual control. Two qualified reviewers with separation of duties. The agent cannot proceed without both signatures.

Two design notes that are easy to miss:

Tiers are assigned by policy, not by the agent. The model does not decide "this is a T3." A configuration rule does. This is the same principle as the configuration-not-code post: the moment the agent itself decides whether it needs approval, you have an attacker-controlled approval gate.
The boundary between T2 and T3 is the one that matters. T1 vs. T2 is "do we log this?" T3 vs. T4 is "do we require two humans?" The T2/T3 boundary is where latency starts to bite, where override rates start to drift, and where approval fatigue (which we covered here) starts to degrade reviewer judgment. Spend disproportionate calibration effort on this boundary.

The latency-vs-control choice for each tier

For every tier, you have to choose between synchronous (the workflow pauses, the human reviews, the agent resumes) and asynchronous (the agent acts and logs, the human reviews later). The choice is a function of reversibility and blast radius.

Asynchronous works when the action is reversible within a meaningful audit window — say, 24 hours. A T2 refund issued at 02:00 UTC can be clawed back at 04:00 UTC if a reviewer flags it. Asynchronous HITL preserves conversational latency and keeps the human review queue at a manageable size. The cost is that, for the audit window, the wrong decision is in production.

Synchronous is mandatory when the action is irreversible or when the blast radius exceeds the human's authority to roll it back. A T3 cancellation is technically reversible — the customer can re-subscribe — but the trust damage of a wrong cancellation is not. Synchronous approval puts the human in the loop at the cost of latency. The trick is to make the synchronous pause invisible to the customer: a believable "I'm checking this for you" message, a structured review surface with the full context, and a sub-60-second median review time for T3 cases.

The Galileo research on HITL agent oversight makes the point sharply: synchronous HITL should be a policy-enforcement mechanism for specific high-risk actions, not a default operational mode. If everything is synchronous, your customer-facing agent becomes a slow chatbot that happens to be a human in the middle.

The override hierarchy

The tier model handles the easy 95% of cases. The hard 5% is what happens when a human reviewer makes the wrong call. This is the part most HITL blog posts skip, and it is the part that determines whether your audit trail will hold up under regulatory scrutiny.

There are three classes of wrong call, and each needs a different override path:

The reviewer approves something they should have rejected. The fraudster gets the $50,000 refund. Override path: a senior reviewer (T4-level, or a fraud analyst) flags the case from the async audit queue, the refund is clawed back, and the original reviewer is added to a calibration set.
The reviewer rejects something the customer was entitled to. The legitimate customer is denied a $200 refund for a defective product, escalates, and churns. Override path: a senior reviewer reopens the case from the same async audit queue, the refund is issued, and the original reviewer receives a feedback signal.
The reviewer is unavailable, slow, or wrong too often. This is the case the HITL timeout post covered. Override path: a routing policy that redistributes cases when a reviewer's override rate drifts outside the configured band, or when their median latency exceeds the tier's SLA.

In every case, the override must produce a structured event with five fields: who overrode, who was overridden, what policy rule was applied, what the new decision is, and a free-text reason. That event is the unit of audit. Without it, your HITL system is a black box with a human in the middle.

The architectural mistake we see most often is conflating the override hierarchy with the approval hierarchy. They are not the same. The approval hierarchy is who is authorized to sign off on a T3 action. The override hierarchy is who is authorized to reverse a decision that has already been signed off on. Conflating them creates a single bottleneck for both new approvals and corrections, which is exactly the kind of thing that causes the approval fatigue failure mode.

Calibration: the metric nobody measures

Every customer-facing HITL system should publish a number that almost nobody publishes: the override rate per tier. This is the percentage of approved decisions in a tier that are subsequently overridden — either by a senior reviewer, by a customer escalation, or by a downstream fraud signal.

The override rate is the single most useful health metric for a customer-facing HITL system, because it tells you three things at once:

Whether your tier boundaries are set correctly. A T2 tier with a 30% override rate is misclassified — those actions are actually T3.
Whether your reviewers are calibrated. A reviewer with a 40% override rate and a 5-second median review time is rubber-stamping. A reviewer with a 0% override rate and a 10-minute median review time is over-cautious and is creating a latency problem you have not noticed yet.
Whether the agent itself is drifting. A sudden jump in the T3 override rate after a model upgrade is a signal that the upgrade changed the action distribution in a way the tier model did not anticipate.

Galileo's production HITL guide recommends pairing confidence thresholds with escalation-rate monitoring as a system health indicator. The override rate per tier is the third leg of that stool, and the one most directly tied to business outcomes: a high override rate in T2 is leaking money on bad refunds; a high override rate in T3 is leaking trust on bad cancellations.

The HITL metrics post covers the broader measurement framework. Override rate per tier is the metric that ties the framework to the specific failure modes of customer-facing agents.

The five failure modes we see most often

Across the customer-facing HITL deployments we have worked on or reviewed, the same five failure modes keep showing up:

Flat thresholds. A single dollar amount or a single confidence score gating all HITL decisions. This works until a fraud ring tests it for a week and finds the soft spot.
Agent-assigned tiers. Letting the model decide what counts as T3. The model is the system under review; it cannot also be the system that decides whether it is being reviewed.
Single-reviewer T3 approvals. A T3 action with only one human in the loop is a T3 in name only. Under the EU AI Act's August 2026 enforcement deadline for high-risk systems, dual control is the only defensible posture for material actions.
No override audit trail. Overrides happen in Slack, in email, in a Notion page, or in someone's head. Six months later, when the regulator asks "who approved this?", the answer is "we think it was Marcus." That answer does not survive an audit.
Latency-blind tier design. Treating T3 the same as T1 from a latency-budget perspective, then wondering why customers abandon the chat. The latency budget is part of the tier definition, not an afterthought.

Each of these failure modes is fixable with configuration, not code — which is the core argument the configuration-not-code post makes for HITL systems in general, and which applies double for customer-facing systems where the tier model has to evolve quickly as fraud patterns shift.

A practical checklist for shipping customer-facing HITL

Before you turn on a customer-facing agent, walk through this list:

Tier definitions are in a config file, versioned and reviewable. Not in the agent prompt. Not in a hardcoded if amount > 500: block.
Each tier has a named latency budget and a named override rate target. T2: 0.5% override rate, 24-hour audit window. T3: 5% override rate, 60-second median review. T4: dual control, 5-minute median review, full audit.
Override actions produce a structured event with the five fields above, written to an append-only log that the agent cannot write to.
Reviewer calibration is reviewed monthly. Override rate, median latency, and override-after-override rate (the cases where a senior reviewer also got it wrong) are reported at the team level.
The customer never sees a raw approval pause. "I'm checking this for you" beats "an agent will review your request" every time.
There is a tested rollback path for every tier. T2 rollback is automated. T3 rollback is human-initiated with an audit trail. T4 rollback requires a second human.

If you can check all six boxes, you have a customer-facing HITL system that will hold up under the next fraud wave, the next regulatory inquiry, and the next CFO question about a five-figure mistake.

How Facio and Placet fit

Facio is the agent runtime that enforces the tier model. When the agent's tool call crosses a T3 boundary, Facio pauses the execution, serializes the agent's state, and emits a structured approval request — including the agent's reasoning trace, the policy rule that triggered the escalation, and the reversible-vs-irreversible flag. Placet is the inbox where that approval request lands: in Slack, in Microsoft Teams, in email, in the Placet web UI, or in any of the other channels we ship adapters for. The reviewer responds, Facio resumes or aborts, and the entire decision — including any subsequent override — is written to an immutable audit log.

That two-half architecture is the same one we described in the HITL Needs Two Halves post. For customer-facing agents, it is not a nice-to-have. It is the difference between an agent you can ship to production and an agent that gets shut down after the first $50,000 incident.

The architecture is also why the When NOT to Use HITL post matters. Not every customer-facing action needs a human. The tier model is the mechanism that lets you decide — explicitly, configurably, and auditably — which ones do.

When AI Agents Talk to Customers: The HITL Architecture Behind Refunds, Cancellations, and Account Changes

When AI Agents Talk to Customers: The HITL Architecture Behind Refunds, Cancellations, and Account Changes

Why customer-facing HITL is a different animal

The four-tier risk model

The latency-vs-control choice for each tier

The override hierarchy

Calibration: the metric nobody measures

The five failure modes we see most often

A practical checklist for shipping customer-facing HITL

How Facio and Placet fit

More on Human-in-the-loop

HITL and the Hesitation Signal: Why the Reviewer's Pause Before Clicking Is the Most Valuable Information They Never Log

HITL and the Stop Rule: Why Every Reviewer Should Have a Personal Threshold for Rejecting on Sight

HITL and the Second-Order Question: Why the First Action's Outcome Determines Whether the Next Review Is Even Possible