The failures hiding inside your AI agent containment numbers (and how to fix them)

Author:

Sumeet Khullar

Reading time:

5 mins

Last updated:

April 27 2026

Blog /AI Virtual Agent / The failures hiding inside your AI agent containment numbers (and how to fix them)

Happy paths are only enough for demos, once an agent begins handling real customer interactions, especially in workflows involving payments, health information, account updates, or other business-critical actions — the standard changes. The question shifts from whether the agent can complete a task, and moves to whether it can do so safely, predictably, and under control.

A Director of Contact Center at a B2B Credit Union described this gap to me last year. His team had an agent handling hundreds of calls a day. The board's question was simple: how do we know when this agent makes a mistake in critical payment flows, and what are we doing to prevent those mistakes in the first place?

They did not have an operating model with defined boundaries, runtime controls, and a way to inspect what the agent actually did in each conversation. These metrics are mission critical and don’t show up when you only measure containment rates and partial CSAT signals.

Containment tells you whether the customer reached a human. It does not tell you whether the agent made the right decision, followed policy, used the right system correctly, or actually resolved the customer’s problem. A call can be contained while still failing the customer completely.

That is the shift teams have to make once an agent goes live: from asking whether the agent can perform, to asking whether the system around it is ready for production.

The framework is straightforward: Configure, Control and Evaluate.

Configure

Configure the agent before it goes live: This is where teams define how the agent should behave in production. That includes guardrailing inputs and outputs to reduce malicious or unsafe behavior, and configuring deterministic paths for critical workflows where the business cannot rely on LLM judgment. Deterministic paths apply to high-risk actions — payments, account updates, compliance-sensitive decisions. The goal is to put clear boundaries around the moments that carry the most customer, operational, or compliance risk.

Control

Control the agent while it is live: The Agent Harness provides the runtime control layer that governs which tools are exposed, when they are exposed, and what authorization is required before they can be used. Sensitive capabilities should be disclosed progressively, not made available all at once. The same applies to information access: the agent should respect RBAC boundaries when retrieving documents, knowledge, or other underlying data sources. Control is what prevents unauthorized actions, premature tool use, or access to information the user is not entitled to see.

Evaluate

Evaluate the agent after deployment: Containment rate and escalation rate miss whether the agent followed the intended path, used the right tool, or resolved the issue correctly.

A quality mechanism for virtual agents lets business and operational teams inspect conversations at scale, across every workflow, agent, and failure type — surface failures early, diagnose patterns, and improve the agent after rollout. Evaluation is the feedback loop that turns production behavior into diagnosable, improvable data.

Early deployments overfocus on agent intelligence and underfocus on the operating model around it. In production, the surrounding system matters as much as the agent itself.

This is not a new idea in the contact center. Human-agent operations have long relied on structure, oversight, and quality review because outcome metrics alone are never enough. Virtual agents need the same discipline, adapted for autonomous systems.

The solution is a layered operating model:

Before the conversation, define the agent's boundaries — which workflows require deterministic paths, which actions require authorization, and what the agent is not permitted to do.
During the conversation, govern sensitive actions and tool access through a runtime control layer.
After the conversation, inspect outcomes across the full interaction population, not a sample.

That is the difference between an agent that demos well and one that is ready for production.

On May 7, I will walk through all three live in the Level AI platform, including what a failed conversation looks like when it is caught. Register below if you are running an AI support agent in production.

This is Part 1 of a three part blog series by Sumeet Khullar, CTO and Co-Founder at Level AI.
Part 2: Four AI Agent Failure Types That Will Not Show Up in Your QA Reports

Part 3 drops soon!

Subscribe to Ctrl+CX

Hear insights directly from Rob Dwyer, Level AI's CX Executive in Residence