A third post in the same thread as the legal and trading ones, because the question keeps coming up in different shapes: "What does the memory layer look like for AI customer service agents?"
The honest answer is that it looks a lot like the legal answer. The cost of a retrieval miss isn't a slightly less helpful response — it's an agent confidently stating a policy that doesn't exist. And the production data on how often that's happening today is sobering.
The category is real and scaling fast
AI customer service has stopped being a demo. Decagon raised at a $4.5B valuation in January 2026 on the back of more than 100 new enterprise customers in 2025 — Notion, Duolingo, Rippling, Bilt, Eventbrite, Substack, Oura, Affirm, Chime — plus F100 airlines, banks, telcos and retailers. One published case has a customer cutting their support team by 80%. Decagon's own write-up frames it as concierge-grade customer experience delivered at every channel a customer touches.
What I think is underappreciated is the technical pressure underneath that growth: every one of those agents is a retrieval problem first and a generation problem second. The model is good. The model can write the answer. The question is whether it has the right policy in front of it when it does.
The Air Canada problem, in numbers
The most-cited cautionary tale in this category is real and public. An Air Canada chatbot told a grieving customer there was a bereavement-fare refund policy that did not exist. The customer made a purchase relying on the response. The British Columbia Civil Resolution Tribunal ruled in 2024 that Air Canada was liable for what the chatbot said. The airline argued the chatbot was a separate legal entity. The tribunal disagreed.
That case made the pattern legible — the "AI invents a policy that sounds plausible" failure mode is now its own thing, and people working in customer-service AI literally call it "the Air Canada pattern": the agent invents return windows, refund conditions, warranty terms, or service-level commitments that the brand cannot actually honor. The model generates them because thousands of similar policies appeared in training data; the brand finds out about the commitment when a customer holds them to it.
McKinsey's 2025 global survey on AI found nearly one in three respondents reporting negative consequences specifically from AI inaccuracy. By the end of 2025 that figure had climbed to 51% of organizations reporting at least one such incident. Some of those are model failures. A surprising amount of them, on inspection, turn out to be retrieval failures the model patched over.
Why retrieval recall sits at the center of this
The consensus in the customer-service AI community is well summarized by Cleanlab's prevention guide: hallucinations happen most often when the AI lacks the right context — when the right policy document, the right past ticket, the right product page, exists somewhere in the knowledge base but doesn't make it into the prompt because the retrieval layer missed it. The model then fills the gap.
Two things are happening at once. First, the retrieval index is usually approximate, so it returns the top K most similar items rather than all relevant items — and the cut-off is somewhere in the 85–95% recall range for typical production HNSW or IVF setups. Second, the corpus a service agent works over is exactly the kind that punishes that gap: policy text that paraphrases across documents, ticket histories with slight wording variations, product specs that overlap. A retrieval that misses one relevant policy is a retrieval that hands the model the freedom to invent.
For a Decagon-class workload running tens of millions of interactions a year, 5–15% recall loss on the retrieval layer isn't a minor accuracy issue. It's the surface area where Air Canada incidents live.
Where UQL fits
The memory service we built doesn't extract policies from your knowledge base at write time. It stores what you give it — policy text, past tickets, product specs, anything — and lets the application compose recall the way it actually needs to, with a 100% recall floor on the vector layer.
Here are the end-to-end benchmarks from our internal UQL run on dbpedia-openai cosine — full dimensions, no truncation.
Hardware: AMD Ryzen 9 · 64 GB RAM · NVMe Gen 4 2 TB · no GPU.
| Phase | What it tests | Recall@10 | p50 | p99 |
|---|---|---|---|---|
| L1 — INTUITION | Vector search | 100% | 4.70 ms | 4.93 ms |
| L2 — EPISODIC | Filtered vector search | 100% | 5.51 ms | 6.02 ms |
| L3 — RBAC STATE | SQL → Vector join | 100% | 7.73 ms | 7.96 ms |
| L4 — ANALYTICS | GROUP BY / SUM / AVG on results | 100% | 5.46 ms | 5.87 ms |
| L5 — CROSS-CHECK | Correctness across modes | 100% | — | — |
| L6 — PIPELINE | Multi-step pipeline plans | 100% | 0.24 ms | 8.69 ms |
The underlying recall-floor story is in 100% recall on disk and the 3072-dim follow-up. UQL is the contract that lets a service-agent application compose against it.
Three service-shaped query patterns
In the same shape as the previous posts — these are the kinds of queries a customer-service agent actually needs, written the way the application would phrase them.
The policy lookup that has to find the right policy
"For a customer in California, returning a final-sale item bought during the Memorial Day promotion, what is the actual return window?"
That's a semantic search over the policy corpus ("final-sale return window") constrained by jurisdiction (region = "CA") and promotion (promo = "memorial_day_2025"). On UQL it's an L3 SQL → Vector join — the structured filters narrow the eligible policy documents, the vector search lands the right one, and the agent reads exactly what the brand's policy actually says. With 92% recall, eight times out of a hundred the right policy exists but doesn't come back, and the model is free to invent the answer. With 100% recall, the "invent the answer" path stops being available — the agent either has the policy or knows it doesn't.
The cross-channel customer history
"Pull everything this customer has touched in the last 90 days — chats, emails, voice transcripts — that mentions billing or fraud."
Decagon's Voice 2.0 release specifically called out cross-channel memory as a wedge, and the underlying need is real: a customer who called yesterday, emailed at 3am, and is now chatting at 9am expects the agent to know the full thread, not the last three minutes. On UQL it's an L2 filtered vector search — customer_id = X AND date_range AND channel IN (chat, email, voice) fused with the semantic match. One call, exact recall, every relevant interaction in time order.
The aggregate that drives an SLA
"For tier-1 customers who escalated this quarter, what were the most common root causes, and what was the median time-to-resolution?"
That's L4 analytics over a vector-retrieved subset — group the matched tickets by classified root cause, aggregate the resolution times. Same call. The operations team gets an answer instead of a JIRA ticket asking the data team for a notebook.
What changes for the agent-builder
If you're building a service or sales agent, the practical effect of moving the recall floor from approximate to 100% is that one specific failure mode goes away. The agent stops being able to quietly miss the right policy and quietly fill the gap. Either it finds the policy and answers correctly, or it doesn't find one and escalates — which is the behavior the prevention playbook recommends in the first place, and which only becomes reliable if your retrieval layer is reliable.
For brands at Decagon scale — tens of millions of interactions, dozens of policy domains, F100 audit posture — the question isn't "does AI customer service work?" (it does), it's "what's the path to making the failure mode auditable instead of invented?". That path runs through the memory layer.
Honest caveats
A few, since they matter:
- We're not Decagon, and we're not building a customer-service application. We're an infrastructure layer that a Decagon-style application (or one of its competitors, or an in-house support team) could choose to run on. The blog above is about what becomes possible at the layer below the application.
- The benchmarks above are on
dbpedia-openaicosine — a standard public dataset, not a policy corpus. The shape of the recall and latency guarantees holds, but for a specific brand's policy archive the numbers would need to be re-run on that corpus. - 100% retrieval recall is necessary but not sufficient. Even with the right policy in the prompt, models can still misread or misapply it. Retrieval recall removes the most common failure mode; it doesn't replace the rest of the prevention stack (validators, low-confidence routing, escalation paths).
- Maturity. We're in private beta. Smaller customer base, fewer integrations than the established memory services. That's real, and worth weighing.
If you're building in this space
If you're building an AI customer-service agent or sales agent, and either the Air Canada failure mode or the cross-channel memory problem is part of what you're thinking about, please reach out. I'd genuinely like to compare notes on what you're using today and where it's working.
— Danny