Human-in-the-loop · Jun 17, 2026

HITL for Code Generation Agents: Why 'Looks Good' Approvals Are Creating Production Incidents

Code generation agents are the fastest-growing HITL use case — and the worst-implemented. Reviewers approve diffs they don't fully understand, miss subtle security flaws, and let prompt injection land in production. The 10× velocity claim isn't real if 30% of approvals are rubber-stamps.

HITLCode GenerationAgent ArchitectureSoftware EngineeringSecurity

HITL for Code Generation Agents: Why 'Looks Good' Approvals Are Creating Production Incidents

There's a peculiar pattern in code review: the human approves, the code ships, and three weeks later the security team finds a vulnerability that should have been caught at the PR stage. The vulnerability was in plain sight. The reviewer clicked "Approve" anyway. The audit trail says the human reviewed it.

This is the code generation HITL problem. AI coding agents produce diffs at a velocity no human can keep up with. The human reviewer is asked to evaluate 200 lines of generated code in 30 seconds. The reviewer approves — not because the code is correct, but because the alternative (carefully reading 200 lines) is not what their workday can absorb.

The result: AI-generated code ships to production with security flaws, performance bugs, and logic errors that the human reviewer was supposed to catch — and didn't. The Fastly 2025 survey found that nearly 30% of senior engineers said fixing AI-generated output consumed most of the time AI saved. Apiiro's research showed developers using AI introduced roughly 10× more security vulnerabilities than those who didn't. The "AI acceleration" narrative doesn't survive contact with the cost of fixing what AI got wrong.

HITL for code generation is a special case — harder than customer-facing actions, harder than data lookups, harder than document generation. Because the artifact under review is structured logic that does what it says in subtle ways, and the surface area for error is enormous.

Why Code Review Is the Worst Case for HITL

The Volume Problem

A single human can thoroughly review maybe 200–400 lines of code per hour. An AI coding agent can produce 2,000+ lines per hour. The math is unforgiving: the human either reviews 10% of what the agent produces (and misses most of it) or the agent slows down to human review speed (and the velocity claim evaporates).

Most teams pick the first option and don't acknowledge it. The PR queue fills with agent-generated diffs. The human reviewers "approve" the queue. Nobody has actually read the code. The HITL gate is present. The HITL gate is non-functional.

The Asymmetry Problem

Customer-facing actions have natural reversibility: a wrong refund can be reversed, a wrong account change can be undone, a wrong cancellation can be uncancelled. Code changes have a different asymmetry: once merged and deployed, a security flaw is exploitable until discovered. The time between "shipped" and "exploited" is the time between approval and incident.

This asymmetry is invisible to the reviewer. They see a diff. They don't see the production environment. They don't see the attack surface. They see syntax-highlighted lines and a green checkmark from the CI pipeline.

The Trust Transfer Problem

A human reviews a junior engineer's PR with calibrated skepticism: "what did they miss?" The same human reviews an AI agent's PR with a different default: "the agent probably got it right, this is just a sanity check." The framing difference is subtle but consistent.

This is the automation bias in HITL: as the perceived source becomes more "intelligent" (a sophisticated AI agent vs. a junior colleague), reviewers' critical evaluation decreases. The automation bias is documented across domains — aviation, medical decision support, and now code review.

The Indistinguishable Error Problem

AI-generated code often looks correct. The syntax is valid. The tests pass. The naming is consistent. The structure mirrors the surrounding code. The reviewer reads it. The reviewer doesn't see anything obviously wrong. They approve.

The vulnerability — a SQL injection vector that doesn't trigger with the test fixtures, a race condition that manifests only under load, a credential exposure that occurs only in the production deployment path — is hidden in the diff but invisible to a reviewer reading at PR-queue velocity.

The Three Categories of Code-Generation HITL

Not all agent-generated code is equal. The HITL design should treat the categories differently.

Category 1: The Format-Safe Change

CSS adjustments, type definition updates, renaming, comment additions, test scaffolding, refactoring that doesn't change behavior. The risk surface is small. The output is mechanically verifiable (linters, type checkers, formatters).

HITL design: Automated validation is sufficient. Linter, type checker, formatter run automatically. The human reviews only when the automated checks fail. A green CI means the change is safe to merge — no human review required.

Category 2: The Behavior-Changing Change

New business logic, API endpoint additions, query changes, validation rule changes, integration code. The risk surface is moderate. The output is not mechanically verifiable in all cases.

HITL design: Synchronous human review. But with structured context: the agent's stated intent, the tests it ran, the diff highlighted to emphasize changes, the test plan and acceptance criteria. The reviewer's job is to compare the agent's intent with the actual code, not to discover the intent from scratch.

Category 3: The High-Stakes Change

Authentication, authorization, cryptography, data access, payment processing, credential handling, anything touching production secrets or PII. The risk surface is high. A subtle error is a CVE.

HITL design: Two-reviewer pattern. The first reviewer is a domain expert who can read the code and reason about the security implications. The second reviewer is independent — a different person, in a different review session, with the first reviewer's verdict hidden. Disagreement triggers escalation to a security specialist.

Two reviewers doesn't mean double the work for every change. It means double the work for the 5–10% of changes that touch security-critical surfaces. For everything else, the single-reviewer path stays.

The Four Failure Modes of Code-Generation HITL

1. The "Looks Good" Approval

The reviewer reads the diff, sees nothing obviously wrong, types "LGTM" or clicks the approve button, and moves on. The 200 lines were skimmed. The vulnerability that the careful reader would have caught in 8 minutes is shipped to production.

Architectural fix: Require reviewers to leave structured comments, not just approval. The review interface should require:

What was the agent's stated intent? (often missing)
What tests confirm the behavior? (the agent's tests, or your own?)
What edge cases did you consider? (at least one)
What could go wrong? (at least one concern, even if dismissed)

This adds 90 seconds of friction per review. Reviews that take 30 seconds are not reviews.

2. The Diff-Length Fatigue

The reviewer sees a 1,800-line diff. They scroll to the bottom. They approve. The diff is too long to read in a single session, so the reviewer reads none of it.

Architectural fix: Agents must produce small diffs. A code generation agent that produces 1,800-line diffs is a code generation agent that should be reconfigured to produce many small diffs. Break the work into reviewable units. A 200-line diff is reviewable. A 1,800-line diff is not.

3. The False-Test Approval

The reviewer sees green CI. The tests pass. They approve. The tests were written by the agent, testing the agent's code. The tests verify what the agent did, not what the system should do. The CI is green, but the system is wrong.

Architectural fix: For behavior-changing code, the test plan must be specified by the human before the agent writes the code — or independently verified after. The agent's own tests are evidence of self-consistency, not of correctness against requirements. The HITL interface should distinguish: "this change has human-specified acceptance criteria" vs. "this change has agent-specified tests."

4. The Prompt-Injection Vulnerability

The agent retrieved a document — a Confluence page, a Stack Overflow answer, an open-source file — that contained a prompt injection payload. The payload instructed the agent to add a backdoor, exfiltrate credentials, or weaken authentication. The agent complied. The diff looks normal. The reviewer approves.

This isn't theoretical. Supply-chain attacks through coding agents are documented: agents trained on or given access to compromised documentation will produce compromised code. The HITL review does not catch this because the malicious code is structured to look like the rest of the diff.

Architectural fix: The review interface should flag any change that touches security-critical code AND was generated with external context. The combination of "diff touches auth" + "agent had internet access during generation" = elevated review. The reviewer is told what to look for.

The Architecture That Works

A code-generation HITL system that actually works has three components:

Component 1: Pre-Review Automated Validation

Before the human sees the diff, automated checks run:

Linter, formatter, type checker
Static security analysis (Semgrep, CodeQL, or equivalent)
Dependency vulnerability scanning
Test execution and coverage analysis

Failures are blockers. The human review is not a substitute for the automated checks. If the linter fails, the agent's code is wrong before the human even sees it.

Component 2: Structured Human Review

The review interface shows:

The agent's stated intent (in one sentence, at the top)
The diff, with security-critical changes highlighted
The tests that were run, and their results
The specific acceptance criteria (if human-specified)
The risk indicators (files touched, lines changed, security-critical surfaces)

The reviewer's job is constrained: they evaluate the agent's intent against the code, not the code in isolation. The friction of structured review (90 seconds minimum per change) prevents rubber-stamp approvals.

Component 3: Post-Merge Monitoring

The merge happens. The HITL system tracks the change in production. If the change causes an incident — a security alert, a performance regression, a customer-reported bug — the audit trail records the failure back to the specific approval, the specific reviewer, the specific decision.

This is the feedback loop. A reviewer who approves changes that cause incidents will, over time, have a higher incident-per-approval rate. The system surfaces this. The reviewer learns — or the reviewer is rotated to less critical work.

The Metrics That Matter for Code-Generation HITL

Override Rate

What percentage of agent-generated diffs are rejected or significantly modified by the human reviewer? A high override rate means the agent is producing wrong code. A low override rate may mean the agent is good, or it may mean the reviewers are rubber-stamping. The signal is ambiguous without more data.

Incident Rate Per Approval

What percentage of approved changes cause an incident in production within 30 days? This is the definitive metric. A low incident rate is the actual evidence that HITL is working.

Review Velocity vs. Diff Size

How long does the human spend reviewing diffs of various sizes? Reviewers spending 5 minutes on a 1,800-line diff are probably not reviewing. Reviewers spending 2 minutes on a 200-line diff might be reviewing carefully. Track the relationship.

Pre-Merge Defect Catch Rate

What percentage of pre-merge defects are caught by automated validation vs. by human review? If automated validation catches 90% and human review catches 10%, the human is contributing marginally. The HITL design needs to focus the human attention on what automated checks miss.

Where Facio Fits

Facio's policy engine enforces the risk-tiered review pattern for code generation. The same submit_code_change action can route to: automated validation only (for format-safe changes), single-reviewer synchronous review (for behavior changes), or two-reviewer pattern (for high-stakes changes). The tier is determined by the action manifest — files touched, security classification, deployment environment — not by the agent's own assessment.

Placet.io's review interface is built for code review, not generic approvals. The diff is the primary surface. The agent's intent, the test plan, the security flags are all visible at the top. The approve path requires structured review (intent acknowledged, edge cases considered, risk identified). The reviewer's job is constrained to evaluating the code against the agent's stated intent and the human's acceptance criteria.

The audit trail captures the full chain — which agent produced the change, what context the agent had access to, which automated checks passed, which human reviewed it, what they said. If the change causes an incident three weeks later, the trail traces back to the specific approval, the specific reviewer, the specific decision.

Key Takeaways

Code generation is the worst case for HITL — high volume, asymmetric risk, automation bias, and error-looks-correct
Three risk tiers require different HITL designs — format-safe (automated only), behavior-changing (single reviewer with structured context), high-stakes (two reviewers, security-focused)
Four failure modes to design against — "looks good" approvals, diff-length fatigue, false-test approvals, and prompt-injection vulnerabilities
Require structured review, not just approval — intent acknowledged, edge cases considered, risk identified, minimum 90 seconds per review
Pre-review automated validation is non-negotiable — linting, static analysis, dependency scanning, test execution. The human review is not a substitute
Track incident rate per approval — that's the definitive metric. Override rate alone is ambiguous
Facio's policy engine enforces the risk-tiered pattern — the action manifest determines the review tier, not the agent's assessment

Sources: The code generation HITL challenges draw on documented research from Apiiro (4× velocity, 10× vulnerabilities), Fastly's 2025 senior developer survey, the ClawHavoc supply-chain attack analysis from Cisco, and production patterns from code review platforms (GitHub, Phabricator). The automation bias findings reflect research from aviation and medical decision support literature applied to AI-assisted code review.

HITL for Code Generation Agents: Why 'Looks Good' Approvals Are Creating Production Incidents

HITL for Code Generation Agents: Why 'Looks Good' Approvals Are Creating Production Incidents

Why Code Review Is the Worst Case for HITL

The Volume Problem

The Asymmetry Problem

The Trust Transfer Problem

The Indistinguishable Error Problem

The Three Categories of Code-Generation HITL

Category 1: The Format-Safe Change

Category 2: The Behavior-Changing Change

Category 3: The High-Stakes Change

The Four Failure Modes of Code-Generation HITL

1. The "Looks Good" Approval

2. The Diff-Length Fatigue

3. The False-Test Approval

4. The Prompt-Injection Vulnerability

The Architecture That Works

Component 1: Pre-Review Automated Validation

Component 2: Structured Human Review

Component 3: Post-Merge Monitoring

The Metrics That Matter for Code-Generation HITL

Override Rate

Incident Rate Per Approval

Review Velocity vs. Diff Size

Pre-Merge Defect Catch Rate

Where Facio Fits

Key Takeaways

More on Human-in-the-loop

HITL for Steuerberater und Wirtschaftsprüfer: The Kanzlei-Compliant Agent Architecture

Humans Are the Weakest HITL Link: Designing for Imperfect Reviewers

HITL for Customer-Facing AI Agents: Refunds, Cancellations, and Account Changes