Back to blog

Security · Jun 8, 2026

Your Monitoring Says Green, Your Agent Is Wrong: The Observability Gap Killing Production AI

Traditional monitoring can keep 99.99% availability while the agent's decisions degrade quality across the workflow. Decision integrity is the new observability surface — and OpenTelemetry's GenAI semantic conventions plus the new MCP tracing layer are finally making it tractable. Here's how to build for it.

Agent ObservabilityDecision TracingOpenTelemetry GenAIMCP TracingProduction AI

Your Monitoring Says Green, Your Agent Is Wrong: The Observability Gap Killing Production AI

Your AI agent is up. Latency is normal. No errors. No exceptions. Token usage is within budget. The health check is green. And the agent has just approved a loan it should have rejected, recommended a drug interaction it should have flagged, or pushed code that broke production.

This is the failure mode that traditional observability cannot detect — and it is the failure mode that defines production AI in 2026.

LogicMonitor's framing captures the structural shift: in deterministic systems, the system fails loudly — crash, timeout, 500 error, resource exhaustion. In agentic systems, the system stays green while outcomes degrade. The agent returns something. It is well-formed. It may even be plausible. The problem is that it is wrong, or it took an expensive detour to get there, or it called a tool it never needed. None of that trips a 500.

The operational risk has shifted from system failure to decision quality. And the observability layer that detects decision quality — capturing intent, reasoning, tool selection, and propagation — is fundamentally different from the observability layer that detects system health.

The New Failure Surface: Decision Integrity

The traditional three-pillar observability model — metrics, logs, traces — was designed for deterministic services. You watch latency and error rates, you read logs when something throws, you trace requests across services. An agent breaks that model in two ways at once.

The same input does not always produce the same behavior. Temperature, retrieval results, and tool availability all shift the path the agent takes. The same prompt can trigger a different sequence of tool calls on two consecutive runs. A single "happy path" trace is insufficient — you need to observe the distribution of behaviors, not one example.

Failure rarely surfaces as an error. The agent returns something. It is well-formed. It may even be plausible. The problem is that it is wrong, or it took an expensive detour to get there, or it called a tool it never needed. This is why step-level tracing — recording each reasoning step, tool call, and model response as a nested span — is the foundational requirement, and why a health check that only reports "up" is close to useless for an agent.

The practical consequence is that observability for agents has to capture intent and process, not just inputs and outputs. The trace must include the reasoning path, the tools considered, the tools actually invoked, the arguments passed, the responses returned, the tokens spent at each step, and the latency of each hop — all stitched into one hierarchical trace you can replay. Runtime tracing of this kind is the natural complement to offline evaluation frameworks, which catch regressions before deployment; tracing catches what production throws at you after.

The OpenTelemetry GenAI Standard: Vendor-Neutral at Last

The most important development in agent observability in 2026 is not a product — it is a specification. The OpenTelemetry GenAI semantic conventions define a common vocabulary for AI telemetry: a standard set of gen_ai.* span and metric attributes that any instrumentation library can emit and any backend can ingest. Adopt them and you decouple your instrumentation from your vendor — you can switch observability platforms without re-instrumenting your agents.

As of v1.41 (May 2026), the specification spans six layers: client (model-call) spans, agent and workflow spans, MCP conventions, semantic events, metrics, and provider-specific attributes. The status is honest: nearly every gen_ai.* attribute still carries a Development badge, meaning attribute names can change without a major version bump. The escape hatch is OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental, which enables dual-emission of both legacy and current attribute names so transitions do not silently break dashboards.

Two histogram metrics are effectively mandatory for any production deployment: gen_ai.client.operation.duration (latency in seconds) and gen_ai.client.token.usage (consumption in tokens, broken down by input and output). Export these two signals or you cannot reason about cost or speed. The major backends already speak the convention: Datadog natively supports OTel GenAI conventions from v1.37 onward (announced December 1, 2025), mapping gen_ai.* attributes to its LLM Observability schema automatically. For Python teams using OpenAI, instrumentation is essentially free: OpenAIInstrumentor().instrument() produces semconv-compliant spans with no manual span creation.

The Span Reference: What Each Agent Span Actually Means

The spec defines four span operation types specifically for agents. The subtle part is the span kind — it determines what the span actually measures and how it should be interpreted in a trace.

Span OperationSpan KindWhen It Fires
create_agentINTERNALAgent definition/instantiation. Carries the agent name, model, and configuration. Fires once when the agent object is created, not per request.
invoke_agentCLIENTRemote agent execution — OpenAI Assistants API, AWS Bedrock Agents. The agent runs on someone else's infrastructure; the span measures the round trip.
invoke_agentINTERNALLocal framework execution — LangChain, CrewAI, LangGraph agents running inside your process. Parents the model-call and tool spans for that agent.
invoke_workflowINTERNALMulti-agent orchestration. One invoke_workflow parents multiple invoke_agent children — the structure that makes handoffs legible in a single trace.
execute_toolINTERNALA single tool/function call. Captures the tool name, arguments, and result. MCP instrumentation enriches this span rather than creating a duplicate.
chat/inferenceCLIENTThe model call itself. Required attributes include model and token usage; input.messages and output.messages are opt-in, not captured by default.

In multi-agent systems, a single INTERNAL invoke_workflow span is the parent that wraps several invoke_agent children. That hierarchy is what lets you follow a task across agent handoffs in one trace. Without it, you see ten independent agent invocations with no record of the workflow that connected them.

The MCP Tracing Layer: The Tool Layer Is No Longer a Black Box

The tool layer used to be the black box in agent traces. You could see that a tool was called and what it returned, but the protocol mechanics underneath — which MCP method, which session, which protocol version — were invisible. OpenTelemetry closed that gap in v1.39, which added MCP semantic conventions with attributes including mcp.method.name, mcp.session.id, and mcp.protocol.version.

The clever design decision is how these attributes attach. When MCP instrumentation detects that an outer GenAI instrumentation already tracks the tool execution, it enriches the existing execute_tool span with the MCP attributes instead of creating a second, duplicate span. You get the protocol-level detail layered onto the tool span you already had — not a noisier trace. This matters for anyone building on Model Context Protocol: the visibility into the tool layer is now standardized, so an agent that calls ten MCP servers can be traced as cleanly as one that calls a single local function.

Privacy by Default: Why Most Telemetry Pipelines Are Dangerous

Content capture is opt-in, not automatic. The spec defines three modes: not recorded (the default), stored on span attributes, or kept in external storage with only a reference URL on the span. Message bodies — gen_ai.input.messages and gen_ai.output.messages — are not captured unless you explicitly opt in. For production systems handling PII, the external-storage-plus-reference mode is the recommended pattern: you keep the trace structure for debugging without writing customer data into your telemetry pipeline.

This is not a minor concern. The most common observability mistake in 2026 is the "I just turned on full content capture to debug an issue and forgot to turn it off" pattern. The traces get shipped to a SaaS vendor, the retention policy expires, and the security team discovers six months later that customer PII has been flowing through a vendor's S3 bucket the entire time. Opt-in capture exists for a reason. Use it deliberately, with a documented policy, and review capture settings as part of your security review process.

What Agent Observability Actually Answers

The three observability disciplines overlap in confusing ways. The cleanest framing is to ask what each one answers:

DisciplinePrimary FocusCore QuestionWhat It Monitors
LLM ObservabilityModel output qualityDid the interaction meet defined quality and safety thresholds?Prompts, token usage, latency, hallucinations, evaluation scores
AIOpsIT operations optimizationIs the infrastructure healthy and responding efficiently?Metrics, logs, alerts, anomaly detection, automated remediation
Agentic ObservabilityDecision integrity across workflowsDid the agent choose the right action, for the right reason, across systems?Multi-step reasoning, tool use, workflow coordination, downstream impact

In deterministic systems, observability captures system health and performance metrics. In agentic systems, it captures decisions, interactions, context, and outcomes. The MELT framework still applies — but each signal now reflects decision behavior, not just system performance:

  • Metrics reflect outcome quality and behavioral patterns — task success rate, retry frequency, decision latency, drift
  • Events reflect agent state transitions and tool invocations — plan revisions, escalations, execution triggers
  • Logs capture decision context — prompt inputs, intermediate evaluations, policy checks, guardrail conditions
  • Traces reconstruct multi-agent workflows — how context moved, which agents participated, how decisions propagated

Without correlation across these four signal types, telemetry appears as isolated noise — a metric spike here, an event there, logs that don't connect to traces.

The Decision Integrity Signals That Matter

Beyond raw trace capture, the production observability layer must surface specific signals that indicate decision integrity. Five categories are essential:

Decision quality over time. Track task success rate, plan revisions per task, retry frequency, and policy violations as time-series metrics. Drift in any of these signals degradation long before the user-visible failures show up in support tickets.

Tool use patterns. Monitor which tools the agent calls, how often, with what arguments, and what the success rate is. A sudden shift in tool selection patterns — particularly increased use of tools the agent has rarely used before — is often the earliest indicator of prompt injection or context poisoning.

Cross-agent propagation. In multi-agent workflows, trace how context and decisions flow from one agent to the next. A misaligned decision at agent 1 should not propagate unchallenged to agents 2 through 5; observability that captures the full chain is the only way to identify where the propagation broke.

Cost-per-decision anomalies. Token usage per task should be roughly stable. A sudden spike often indicates the agent has fallen into a reasoning loop, generated excessively long intermediate outputs, or invoked an expensive tool repeatedly. Cost anomalies are also one of the most reliable signals of model extraction attempts or adversarial probing.

Compliance-relevant events. Every action that affects regulated data, every tool invocation that crosses a system boundary, every policy decision — permitted or denied — must be captured in an audit-grade format. This is where agent observability merges with the compliance-grade audit trail architecture: the same trace that helps you debug also helps you satisfy an EU AI Act or SOC 2 audit.

Where Facio Fits

The observability architecture described here is not theoretical. Facio (the HITL-first agent runtime) implements the agent observability layer at the platform level:

  • Every agent action is captured as a structured span — reasoning steps, tool invocations, model calls, policy evaluations — in the format that maps directly to OpenTelemetry GenAI semantic conventions
  • The audit trail is tamper-evident by design: every span carries full provenance, every tool call is attributable to a specific agent and task, every authorization decision is recorded with the policy applied and the outcome
  • Cross-agent propagation is captured through invoke_workflow parents that link invoke_agent children into a single hierarchical trace — so multi-agent handoffs are legible in one view, not stitched together from disconnected spans

For human-in-the-loop decision integrity, Placet.io (the HITL inbox and messenger) provides the structured approval layer: when an agent reaches a decision boundary requiring human review, the request is delivered with full context, and the human decision is recorded as part of the trace.

The Bottom Line

The system can stay green while outcomes degrade. That sentence is the operational reality of production AI in 2026, and traditional observability cannot detect it. The shift from system-health monitoring to decision-integrity observability is the most important infrastructure change happening in production AI right now.

The OpenTelemetry GenAI semantic conventions and the new MCP tracing layer are finally making the decision layer tractable. The teams that adopt them now — vendor-neutral instrumentation, opt-in content capture, correlation across the MELT signal types, behavior-centric metrics — will be the ones that catch the next Clinejection, the next memory poisoning, the next lateral-movement attempt not because they built perfect defenses, but because they had the observability infrastructure to detect what was happening before it became catastrophic.

Uptime is no longer a sufficient signal of correctness. Decision integrity is.


Further reading: