The Chatbot That Stalled: A Conversation Design Case Study
The platform was not the problem. The experience design was never done.
The call came seven months after go-live. A mid-market property and casualty insurance carrier had launched a conversational AI assistant for its policyholder contact center. By platform metrics, the deployment looked acceptable. The chatbot was containing approximately 42 percent of tier-one contact volume without escalation. Integrations with the policy management system were stable. Response latency was within spec.
The problem was CSAT. It had been sitting at 61 percent since month two and showed no sign of movement. Internal leadership had assumed performance would improve naturally as the system accumulated data. It had not. Policyholders were completing interactions but calling back the same day. Escalation volumes for sessions the platform marked as "resolved" were climbing steadily.
ICX was engaged to diagnose and redesign the experience. What the audit found was a pattern that appears consistently in mature chatbot deployments: capable platform infrastructure with no conversation design layer underneath it.
A System That Was Technically Functional
The carrier's chatbot platform was modern and capable. The underlying language model was current, API connections to the policy database were stable, and the front-end interface matched the company's brand standards. The vendor's implementation team had delivered what the contract specified. The system was live and handling volume.
But the conversation layer had been treated as a configuration task rather than a design discipline. The system prompt controlling the AI's behavior had been written during a two-day vendor setup session and never revisited. It described the AI's purpose in three sentences, listed a short set of prohibited topics, and included a tone directive that could have been copied from any generic customer service policy document. It bore no connection to the carrier's brand voice, claims philosophy, or the actual decision logic its agents used when resolving similar questions by phone.
Intent coverage followed the same pattern. Twenty-two defined intents were meant to cover a product with more than 200 coverage permutations. Policyholder questions about claim status timelines, coverage exclusions, deductible calculations, and third-party liability scenarios were landing in the same intent buckets and receiving the same templated responses. The containment metric was accurate. The chatbot was containing sessions. What it was not doing was resolving the actual question behind each one.
Containment and resolution are not the same metric. A chatbot can end a session without escalation while leaving the policyholder's actual question unanswered. The CSAT score reflects which one happened.
What the Conversation Data Revealed
ICX reviewed three months of conversation logs, focusing on sessions the platform marked as "contained" but followed within 24 hours by a live agent contact on the same policy. Three patterns dominated the data.
Intent resolution was shallow across the board. The 22-intent architecture was generating confident responses to ambiguous inputs. When a policyholder asked about a claim timeline, the system returned a generic response about the claims process without distinguishing between first-party claims, third-party liability claims, and catastrophic event claims, which operate on materially different timelines under the carrier's own policy language. Policyholders were receiving responses that were technically not false but were not useful for the question they had actually asked.
The system had no handling for compounding context across a multi-turn session. Policyholders would open with one question, receive a response, and follow up with a question that assumed the AI retained what had been established. In most cases, each follow-up was treated as a new session. Responses repeated earlier information or, in some cases, contradicted it. The logs showed policyholders restating context they had already provided, often more than once, before abandoning the session.
Escalation was a wall, not a door. When the system reached its resolution limit, it offered a single fallback: a phone number and a message stating the question required speaking to an agent. No acknowledgment of what the AI had understood was included. No context was passed to the live agent queue. No alternatives such as callbacks or follow-up emails were offered. For policyholders already managing a claim situation, the escalation message read as abandonment.
Three months of conversation logs revealed patterns the platform dashboard had not surfaced.
The Redesign: Three Layers, in Sequence
ICX recommended against a platform change. The infrastructure was sound, and a migration would have introduced risk without addressing any of the actual problems. The work needed to happen at the experience layer, and that layer was fully accessible without touching the vendor contract.
The system prompt was rebuilt over six weeks. The revised prompt established the AI's role with specificity: what it could answer confidently, what it should handle with explicit uncertainty language, and how it should behave when a policyholder showed signs of distress. Tone guidance was drawn directly from the carrier's existing brand standards and policyholder communication guidelines rather than generic CX language. Handling instructions for the 15 most frequent ambiguous scenarios were written into the prompt explicitly rather than left to model inference. The result was a system prompt that functioned as a set of design decisions, not a configuration note.
Intent architecture was rebuilt in collaboration with the carrier's claims and underwriting subject matter experts, who identified the coverage distinctions that most frequently drove policyholder confusion. The revised structure grouped questions by resolution pathway rather than by surface-level topic keyword. A question about a claim timeline from a policyholder with a third-party liability claim now routed differently from the same surface-level question from a policyholder with a first-party homeowners claim. The AI was routing based on what the policyholder needed resolved, not on the words they had used.
Escalation flows were redesigned to be graduated and context-forward. When the AI reached a resolution limit, the new flow acknowledged what it had understood, confirmed what it was passing to the live agent, and offered a choice: continue waiting on the current channel, request a callback, or receive a follow-up summary by email. The live agent queue received a pre-populated context card summarizing the key points of the conversation. Agents no longer started from zero. Policyholders no longer had to repeat themselves.
As ICX has documented elsewhere, the language layer is where most chatbot performance problems live. Redesigning it does not require a new platform. It requires treating conversation design as real work, with dedicated time, the right expertise, and a clear brief derived from actual conversation data rather than assumptions made during initial implementation.
What 90 Days of Data Showed
Three months after the redesigned experience went live, CSAT had moved from 61 percent to 84 percent. Same-day agent re-contacts on sessions the platform marked as "contained" dropped by 41 percent. Average handle time for escalated sessions decreased, because agents were receiving structured context rather than starting each session without background. First-contact resolution on the top five intent categories increased across the board.
The platform had not changed. The underlying model had not changed. The carrier's policy data and coverage structures had not changed.
What changed was the experience layer: the system prompt, the intent architecture, and the escalation design. None of that work had been in scope during the original vendor implementation. None of it had been identified as a gap in the post-launch review. All of it was recoverable once the right expertise and the right audit methodology were applied to the actual conversation data.
The consistent finding across similar ICX engagements is that enterprise chatbot performance plateaus are almost never platform problems. They are experience design problems, and they are most visible not in the containment rate but in what happens after the session ends. A capable model on a sound platform with an underdeveloped system prompt and no escalation design will produce exactly the results this carrier saw. The 30-minute AI CX audit is a starting point for identifying which layer the problem lives in, and the hidden cost of good enough AI documents what organizations pay, in CSAT and in re-contact volume, when that layer goes unaddressed.
For organizations recognizing this pattern in their own deployments, the services page covers how ICX structures conversation design and system prompt audit engagements, and the contact page is the right place to begin a conversation about what the data in your own logs might be telling you.
AI Transparency Disclosure
This article was created with the assistance of AI technology (Anthropic Claude) and reviewed, edited, and approved by Christi Akinwumi, Founder of Intelligent CX Consulting. All insights, opinions, and strategic recommendations reflect ICX's professional expertise and real-world consulting experience. The case study details have been anonymized to protect client confidentiality.
ICX believes in radical transparency about AI usage. As an AI consulting firm, it would be contradictory to hide the tools that make this work possible. Anthropic's Transparency Framework advocates for clear disclosure of AI practices to build public trust and accountability. ICX applies this same standard to its own content. Read more about why AI transparency matters.