A question that came up after the Pinecone / Qdrant / Milvus comparison: "OK, but how are you actually getting 100% recall on disk if you're not doing brute force and not using mmap?"
It's a fair question. The two standard paths to 100% recall in 2026 are:
- GPU brute-force — compute distance from the query to every vector in the corpus, in parallel on GPU memory. Exact by construction. Exa Labs is the clearest production reference here: their search index runs on the "Exacluster" of 144 NVIDIA H200 GPUs across 18 servers, and that's what it takes to keep their answer set complete at web scale.
- mmap-backed exact index — keep an exact index on disk, let the OS page cache hold the working set in RAM, serve queries from there. Production vector databases including Milvus ship and document this mode as the way to fit more vectors per box than would fit in RAM directly.
Both work. Both ship in production today. We didn't take either one, and the reason isn't "we have a better algorithm" — it's that each one has a structural property that ruled it out for the workloads we kept hearing about. This post walks through what those properties are and why we ended up on a different path.
What brute-force buys, and what it costs
Brute-force vector search is the simplest correct algorithm: for each query, compute the distance to every vector in the corpus, sort, return the top-K. Recall is 100% by definition because nothing got skipped.
The cost is in the inner loop. At 1M vectors × 1,536 dimensions × 4 bytes per float, that's roughly 6 GB of arithmetic per query. On a CPU, even with SIMD, that's hundreds of milliseconds to seconds. On a GPU with high-bandwidth memory, the same math takes milliseconds — at the price of putting GPUs in the latency path of every query. This is the path NVIDIA themselves describe in detail for high-throughput exact and near-exact vector retrieval.
For specific workloads, that's a fine trade. If your application can budget GPU time per query and pay a GPU-class infrastructure bill, brute-force is the most operationally simple shape. There's no index to maintain, no recall to tune, no graph to keep healthy under streaming writes. Exa Labs' GPU cluster sized at 144 H200s for web-scale retrieval is the production reference point — when the corpus is the open web, that's what the bill looks like.
The trade we couldn't make:
- GPU in the query path is a structural cost. A 6 GB-of-arithmetic-per-query workload at any meaningful QPS pins a GPU per active query. Cloud GPUs cost roughly 50–200× a CPU on equivalent throughput. For an application running thousands of queries per second, the math stops working before the workload becomes interesting.
- It doesn't generalize to disk. The whole reason GPU brute-force is fast is that the vectors live in VRAM. The moment your corpus exceeds VRAM, you're paging from disk into VRAM for every query, and the latency advantage disappears.
- Streaming writes are awkward. Every new vector has to make it into VRAM somehow; you end up with a hybrid where the index is split across hot and cold tiers.
For a workload that needs sub-millisecond p50, 100% recall, and commodity infrastructure, GPU brute-force isn't the answer.
What mmap buys, and what it costs
The other standard path is to build an exact index on disk and use mmap() to memory-map the file. The OS page cache handles "what's hot" — pages the query touches stay in RAM; pages it doesn't get evicted. From the application's perspective, the index is "on disk," but in practice the working set lives in the page cache.
When the working set fits in RAM, mmap-backed indexes are excellent. Reads are essentially memory speed, no application-level cache to maintain, the OS does the right thing. The argument against relying on mmap as a database substrate isn't ours — it's been made carefully in the database research community for years. The peer-reviewed CIDR 2022 paper "Are You Sure You Want to Use MMAP in Your Database Management System?" walks through four specific problems (transactional safety, I/O stalls, error handling, performance) and concludes that for systems that care about predictable tail latency, the OS page cache is the wrong abstraction. The USENIX ATC '20 paper on optimizing mmap for fast storage documents that under page eviction or multi-SSD workloads, mmap is 2–20× slower than direct file I/O.
Even the production vector databases that ship mmap mode are upfront about the trade-offs. Zilliz/Milvus's own write-up on mmap says it cleanly: "Performance gradually decreases as data volume grows, and the feature is recommended for users less sensitive to performance fluctuations."
The properties that ruled it out for us:
- The "on disk" story is misleading once you measure it. Production mmap workloads typically need ~1 GB of RAM per 1M vectors for the hot path to actually stay hot. Calling that "on disk" oversells what's happening — your index is on disk, but your RAM bill scales with index size if you want the published latency.
- Performance falls off a cliff when the working set exceeds RAM. A query that touches a cold page generates a page fault, an NVMe read, and a wait. p50 stays great, but p99 walks. Tail latency under memory pressure is exactly when production cares the most — and the academic literature above is consistent on this point.
- Workload mix matters more than it should. If the box is doing anything else — another tenant, a background compaction, a sibling service — the page cache gets evicted and your "in-memory" performance becomes "on-disk-with-faults" performance.
- The behavior at 100K and at 10M is different. A claim that's true at 100K ("sub-millisecond, 100% recall, on disk") often isn't true at 10M without proportionally more RAM. We wanted the shape of the behavior to hold across scales.
For workloads where the corpus is small and the box has plenty of RAM headroom, mmap is fine. For everything else, we wanted out from under the page cache dependency.
The third path
We ended up on a design that's neither brute-force nor mmap. The behavior is what the public benchmarks show — 100% Recall@10 at sub-millisecond p50 on commodity disks, at both 1536-dim and 3072-dim, holding the same shape across 100K and 1M corpora. No GPU. No reliance on RAM headroom.
I'm not going to describe the design here. That's the work we still need to recoup. What I'll say at the outcome level — which is what most readers actually need — is that the system delivers correctness as a guarantee rather than as a parameter, and the cost model doesn't degrade as the corpus grows. Whether that's the right fit for a given workload is something we're happy to walk through privately.
What this means in practice
For most readers the practical question isn't "what algorithm" but "what does this change about my workload." Three things, concretely:
- The recall floor is fixed. You don't tune toward 100%; it's there by default. The conversation with the application team moves from "what recall can we live with" to "what query do you want to compose."
- Hardware sizing is simpler. Disk capacity sets the upper bound on corpus size. RAM affects throughput, not correctness — a smaller box is slower, not less accurate.
- Streaming inserts stay correct. A vector you wrote 30 seconds ago is queryable at 100% recall, alongside vectors that were there at index build time. No "reindex window."
If your workload was already in a place where 99% recall and a GPU bill were acceptable, none of this changes the math for you. If you were paying for "99% recall and an in-RAM index that doesn't quite scale," the trade looks different.
Honest caveats
- The benchmarks above are single-node on
dbpedia-openai. For your corpus, the numbers would need to be re-run on that data. - There's no free lunch. Pure-disk mode is slower than mmap when mmap is doing its best job. The trade is throughput vs predictability and footprint, not throughput vs accuracy.
- Maturity. Private beta. Smaller customer base. The design is published-benchmark-stable; the surrounding operations are still being hardened.
If your workload sits in the place where neither brute-force nor mmap fits cleanly, please reach out. I'd genuinely like to compare notes.
— Danny