A short follow-up to the 100% recall on disk post and the 3072-dim follow-up, this time on the search-engine-for-AI category that's emerged as its own thing over the last 18 months.
I had originally planned to write this as a memory post in the same thread as the legal, trading, and customer-service ones. After looking at the actual workload, though, I think it belongs in the vector thread instead — what these teams are building is closer to a search index than an agent memory layer.
The category is real
Exa raised $85M at a $700M valuation in September 2025, with Benchmark, Lightspeed, Nvidia, and YC on the cap table and advisors from OpenAI, Google, and Bing. The product is a semantic search API designed specifically for AI agents — not Google for humans, but a retrieval layer for applications like Cursor's @web feature and Notion AI's news search. Their own engineering write-ups put the scale at tens of billions of web pages with minute-level refresh rates, embedded into dense vectors and served from a GPU-accelerated cluster.
What I think is interesting about the timing — and the reason this fits in the same thread as the recall posts — is that the market has accepted that "search built for AI agents" is its own product category, distinct from both keyword search and from memory layers. The agents need a retrieval API that returns the right documents semantically, not the most-clicked ones, and the underlying technology is the same vector-retrieval primitive every memory service uses.
The recall problem looks slightly different here
In memory workloads, the user picks what to store — "these are my contracts, my tickets, my conversations" — and the retrieval has to find what's there. In an AI search engine, the corpus is the open web (or a vertical slice of it), and the retrieval has to find the right page out of hundreds of millions of candidates.
The current industry baseline is sobering once you look at it. The 2026 RAG benchmark community considers recall above 0.65 to be "industry standard" and recall above 0.80 to be a stretch goal for enterprise systems. Most production RAG stacks — including the open-source defaults — sit between 70% and 85% recall at the retrieval step. That number doesn't sound bad in isolation, but the cost compounds:
- When the right document isn't returned, the model has to answer from what did come back. Sometimes that's nothing useful, and the model invents. Sometimes it's an adjacent, slightly-wrong document, and the model confidently summarizes the wrong source.
- For an AI agent running thousands of search queries per task, "70% per query" compounds badly. Three calls in a chain at 70% recall each = 34% chance all three returned the right document.
- The model can't tell the difference between "the corpus doesn't contain the answer" and "the retrieval missed the answer." The downstream behavior is identical, and indistinguishable from the application's perspective.
This is the same shape as the Stanford 17-33% hallucination finding in legal AI and the Air Canada pattern in customer service, just at a different layer of the stack. When the retrieval floor is somewhere around 70-85%, the model is being asked to reason from a sometimes-incomplete picture, and "sometimes" is where the most confident wrong answers live.
What 100% recall changes for AI search
The benchmarks we've published — 100% Recall@10 on dbpedia-openai at both 1536-dim and 3072-dim, in single-digit milliseconds — describe the engine that sits underneath. For a search-engine-for-AI application, that's the layer where the "return the right document" contract lives.
Three things become possible that aren't quite possible at 85% recall:
Verticalized search where every document matters. Legal corpora, regulatory archives, internal product documentation, scientific literature — anywhere the application can't afford to silently miss a relevant document, the retrieval floor is the whole game. Hitting 100% at the index level removes one entire category of failure from the system: the "the document existed but didn't come back" failure.
Composable retrieval. A real AI search query is rarely a single vector lookup. It's "find documents about X, filtered to publications after Y, from sources Z, that cite specific entities." On a UQL-style engine those compose in a single round-trip — the search post on SQL + Vector composition walks through the L3 join path. On most search APIs, the application makes multiple calls and intersects in code.
Honest empty results. When recall is 100% and the search returns nothing, the application knows the corpus doesn't contain the answer. The agent can route to a different tool, ask the user for clarification, or admit it doesn't know. Today the agent can't distinguish "corpus miss" from "retrieval miss," and the behavior the user sees is "agent invented something."
Where this fits (and where it doesn't)
I want to be careful here, because Exa operates at web scale and we don't. They serve tens of billions of pages from a GPU-accelerated cluster, and getting 100% recall at that scale is a different engineering problem from what we've benchmarked at 1M scale on commodity disks.
The honest fit for us is the layer underneath a vertical search application — a legal-research tool indexing a few million opinions, a pharmaceutical research tool indexing the PubMed corpus, an enterprise's internal knowledge base, a regulated archive. The shapes where:
- The corpus is bounded (1M to 100M documents, not tens of billions).
- Every document matters individually.
- The application needs the retrieval to be reproducible and auditable.
- The cost of a missed document is higher than the cost of a slightly slower query.
For the open-web search-for-AI use case at Exa's scale, the trade-offs are different and Exa is doing impressive engineering on a problem we're not trying to solve.
The numbers, as a recap
For completeness — the engine's recall and latency from our public benchmark runs:
| Workload | Recall@10 | p50 | p99 |
|---|---|---|---|
dbpedia-openai-100k (1536-dim, mmap) | 100% | 216 µs | 220 µs |
dbpedia-openai-1m (1536-dim, mmap) | 100% | 1.88 ms | 1.93 ms |
dbpedia-openai-3-large-100k (3072-dim, mmap) | 100% | 407 µs | 411 µs |
dbpedia-openai-3-large-100k (3072-dim, pure-disk) | 100% | 1.04 ms | 1.07 ms |
All on dbpedia-openai cosine, no truncation, no approximation, no GPU. Reproducible against the public dataset — if the numbers don't come out the same on your hardware, please tell me; that's on us to explain.
Hardware: AMD Ryzen 9 · 64 GB RAM · NVMe Gen 4 2 TB.
Honest caveats
A few, because the topic deserves them:
- We are not Exa, and we are not trying to be. The blog above is about a different scale and a different workload. Exa's value at web scale is genuine; we'd recommend them for that fit.
- The benchmarks above are on
dbpedia-openaicosine — a standard public benchmark with public query sets, not a vertical search corpus. The shape of the recall and latency guarantees holds, but for a specific application the numbers would need to be re-run on that corpus. - 100% retrieval recall is necessary but not sufficient. Even with the right document in the prompt, the application still has to reason over it correctly. Recall removes one specific failure mode; the rest of the application-quality stack still matters.
- Maturity. We're in private beta. Smaller customer base, fewer integrations than the established vector databases. That's real, and worth weighing.
If you're building a verticalized AI search application
If your corpus is bounded but every document matters, and you're hitting the recall floor in your current vector stack, please reach out. I'd genuinely like to compare notes on what you're using today and where it's working.
— Danny