The reason I started working on our vector engine is that more than a few engineering teams kept asking us, in different words, the same question: why does their "vector database" only find 92% of their results?
The honest answer is that most modern vector stores ship with a real trade-off. Want speed? Give up some recall. Want full recall? Use a lot of RAM. Want both? Reach for a GPU. The HNSWs and IVFs of the world have very good papers explaining that this is the natural shape of the problem, and for most workloads the trade is fine.
For a few workloads, though — agent memory, regulated search, anything where a missed result is a real cost — the trade isn't fine. So we tried to see how far we could push the other corner of the design space. Here's where we ended up.
The benchmark
dbpedia-openai is a standard vector-search benchmark — 1.5M Wikipedia article embeddings produced by OpenAI's text-embedding-ada-002, 1536-dim cosine. We run two slice sizes: 100K vectors and 1M vectors. Both slices are the real benchmark, not a synthetic favorite-case.
Two run modes on each:
- Pure disk — every read served from NVMe, nothing cached in memory beyond a tiny working set. This is the mode that lets you run a billion-vector index on a 32 GB box.
- mmap — the same on-disk index, but with the OS page cache doing what page caches do. This is the mode you want if you have the RAM headroom.
Both modes hit 100% recall. No approximation, no fallback. Same algorithm, different IO path.
Hardware: AMD Ryzen 9 · 64 GB RAM · NVMe Gen 4 2 TB · no GPU. Everything below runs on this single workstation.
dbpedia-openai-100k — what one machine looks like
mmap mode
| Metric | Result |
|---|---|
| Dataset | dbpedia-openai-100k |
| Recall@10 | 100% |
| QPS | 4,453 |
| p50 latency | 216 µs |
| p99 latency | 220 µs |
215 microsecond median, 100% recall, on the OpenAI 1536-dim cosine set. Tight p50-to-p99 gap (only 4 µs of tail). No GPU touched this benchmark.
Pure-disk mode
| Metric | Result |
|---|---|
| Dataset | dbpedia-openai-100k |
| Recall@10 | 100% |
| QPS | 1,776 |
| p50 latency | 545 µs |
| p99 latency | 584 µs |
A bit slower — disk reads dominate when the OS isn't holding the index in page cache — but still sub-millisecond at p99 with full recall. The point of this mode isn't peak QPS; it's that you can run 100K vectors with a few hundred megabytes of RAM and still get exact answers.
dbpedia-openai-1m — what happens when you scale
mmap mode
| Metric | Result |
|---|---|
| Dataset | dbpedia-openai-1m |
| Recall@10 | 100% |
| QPS | 529 |
| p50 latency | 1.88 ms |
| p99 latency | 1.93 ms |
Ten times the data, latency goes up by roughly 9× — which is what you'd expect from honest on-disk math, not from "we kept everything in memory and called it a database." Still 100% recall. Still no GPU.
Pure-disk mode
| Metric | Result |
|---|---|
| Dataset | dbpedia-openai-1m |
| Recall@10 | 100% |
| QPS | 193 |
| p50 latency | 5.08 ms |
| p99 latency | 5.17 ms |
At 1M scale, pure-disk mode is doing real work — every query is fetching pages off NVMe, no cache assists. 5 ms p99, 100% recall, on 1.5 GB of disk. The tail is still tight: p99 is only 90 µs above p50, which tells you the IO path is doing what it's supposed to be doing.
What about 3,072 dimensions?
The numbers above are on 1536-dim vectors (the OpenAI ada-002 embedding family). The newer text-embedding-3-large model produces 3,072-dim vectors, and that's where most of the recent recall pain in the field lives — the ICLR 2026 stability paper goes into the why in detail.
We ran the same engine against dbpedia-openai-3-large-100k and wrote it up in a separate post — 100% recall on the full 3,072 dimensions, no Matryoshka truncation, sub-half-millisecond p50 in mmap mode.
Why this was hard
The textbook reason "100% recall is impossible on disk" is the brute-force argument: to find the true nearest neighbors, you have to compare against every vector, and at 1M × 1536 dimensions × 4 bytes that's ~6 GB of arithmetic per query. So either you keep it all in RAM, or you give up exact answers and use approximate methods like HNSW. Both are sensible trade-offs.
We tried a different path. The implementation details are our work to keep, but the outcome is what the tables above show: the same answer brute force would give you, in a fraction of the work, on a fraction of the RAM. We're not claiming this is the only way — just that for the workloads where correctness has to stay at 100%, it's now an option that doesn't cost a GPU.
What this opens up
A few things become a bit easier:
- Streaming inserts without a reindex. New vectors land at write time and are queryable immediately, no overnight rebuild. Approximate graphs often need to be reconstructed to stay accurate at scale; this approach doesn't.
- One server class instead of two tiers. No separate "vector tier" with GPUs and 256 GB of RAM. The disk-only mode runs on the same boxes as the rest of your stack.
- Confident answers in regulated workloads. Semantic search over medical records or legal documents is one of the places where "we found 92% of the matches" is genuinely painful. 100% recall isn't a luxury for those teams; it's most of the job.
What this doesn't change
A few honest caveats:
- Pure-disk mode at 1M is not 4K QPS. 193 QPS is the number. If your traffic profile needs more than that at that scale, mmap mode (with the RAM to back it) is the right choice. Both modes hit 100% recall; the trade is throughput vs footprint, not accuracy.
- Cosine distance, 1536-dim above; 3072-dim covered separately. Other metrics (dot, L2) work too. Much larger dimensions (8K+) we haven't published numbers for yet.
- Single node. These are single-node numbers. Replication is a separate story.
What you can verify
The benchmarks are dbpedia-openai-100k and dbpedia-openai-1m. Both public; vectors and query sets are public. If our numbers don't reproduce when you run them, that's on us — please tell us. (The dbpedia-openai-3-large-100k results are in the 3072-dim follow-up post.)
And if you're running vector search at any kind of scale and the recall–latency–footprint trade-off is biting you, I'd like to hear about it. Drop me a line — I'd be glad to compare notes against your current stack.
— Danny