A short comparison post, since the question comes up: "Where does your vector engine sit next to Pinecone, Qdrant, and Milvus?"
I'm going to keep this functional rather than evaluative — what each system does, what we do, and where the trade-offs differ. Pick what fits your workload.
What Pinecone does
Pinecone is a managed service that abstracts the indexing layer entirely — users don't pick an index type or tune parameters. Queries hit a managed endpoint and return results.
Why recall plateaus below 100% at scale: like every production system built on approximate nearest-neighbor (HNSW-style) graph indexes, the algorithm explores a neighborhood around the query rather than scoring every vector in the corpus. That's the property that makes the index fast — and it's also the property that makes 100% recall structurally impossible without a full brute-force pass. At 100M vectors, the managed service has to choose a graph-exploration budget that fits inside a latency target; whatever the cutoff is, the missed 1–5% is by design, not by tuning gap. There's no parameter the user can flip to recover those results because the parameter isn't exposed.
What Qdrant does
Qdrant exposes retrieval as composable primitives — indexing, scoring, filtering, and ranking are each user-controllable parameters. Implementation is Rust with SIMD utilization on hot paths.
The supported indexing approach is HNSW with tunable parameters. Recall is whatever the user dials it to.
What Milvus does
Milvus offers 11+ index types — IVF_FLAT, HNSW, DiskANN, SCANN, and more. Different parts of a corpus can use different indexes.
The system is designed to scale to hundreds of millions to billions of vectors. Operational complexity scales with that, and the user chooses the recall-vs-speed trade per index type.
The shared property underneath
The dominant index type across all three of these in 2026 is still HNSW (or HNSW + an alternative for cold data). HNSW has a structural property: it achieves recall above 0.99 at sub-millisecond latencies, but the working set has to fit in RAM. Once the index spills past available memory, latency climbs (mmap paging) or recall drops (switch to a different index).
IVF is the more memory-efficient alternative, but it's slower on small datasets and requires periodic re-clustering as the corpus grows — streaming inserts aren't fully searchable until the next rebuild.
In all three of the production systems above, 100% recall either costs you the GPU brute-force path or it costs you a workload that fits entirely in RAM.
What we do
We took a different starting assumption: keep the recall floor at 100% and make the index work from disk.
- Recall floor at 100%. The engine returns the exact set brute-force would return. On
dbpedia-openai-100k(1536-dim cosine) we measure 100% Recall@10 at 216 µs p50, and ondbpedia-openai-3-large-100k(3072-dim cosine) 100% Recall@10 at 407 µs p50. Numbers and methodology in the recall and 3072-dim posts. - Native on-disk, no mmap dependency. The index is served from disk, not from an in-memory representation that the OS page cache happens to be holding. Behavior at 100K and at 10M is the same shape.
- Streaming inserts without rebuild. New vectors are queryable at write time, and the recall floor doesn't move while writes are landing.
- UQL composability beyond pure vector search. A single call can fuse a SQL filter, a vector search, and an analytic step — covered in the legal-AI post.
How the trade-offs compare
| Capability | Pinecone | Qdrant | Milvus | Ours |
|---|---|---|---|---|
| Recall floor | tuned by service | user-tunable | per index type | 100% always |
| Index lives in | RAM (managed) | RAM (HNSW) | RAM or disk per index | Disk |
| User-tunable parameters | No | Yes | Yes | Few — recall is fixed at 100% |
| Streaming inserts at full recall | n/a | partial (HNSW) | depends on index | Yes |
| SQL + vector in one call | No | No (filters only) | No (filters only) | Yes (UQL L3) |
| Analytics over results in one call | No | No | No | Yes (UQL L4) |
| Operational model | Managed only | Self-host or cloud | Self-host or cloud | Managed, BYOC, or appliance |
The empty cells aren't gaps in the others; they're design choices. Pinecone trades parameter control for operational simplicity. Qdrant and Milvus expose the parameters so you can pick your spot on the recall/RAM curve. We don't expose the recall parameter because we don't ship the approximate path.
Where each system tends to fit
These are functional descriptions, not endorsements:
- Pinecone fits when the team prefers a managed surface with no parameter tuning, and the workload tolerates the index choices the service makes on its behalf.
- Qdrant fits when the team wants to tune parameters and operate the system, and the workload is well-served by a tunable HNSW.
- Milvus fits when the corpus is large enough (hundreds of millions to billions) that different parts benefit from different index types, and the team has bandwidth to operate the resulting complexity.
- Ours fits when the workload is bounded (millions to low tens of millions), the recall floor has to be 100%, and the query is often more than a single vector lookup — filtered, joined, or aggregated in the same call.
Honest caveats
- Single-node benchmarks. The numbers above are single-node. Distributed scaling works in our internal deployments; we haven't published a multi-node benchmark.
dbpedia-openaiis a public benchmark, not your corpus. For a specific dataset the numbers would need to be re-run on that data.- Maturity. Private beta. Smaller customer base, fewer integrations than the established options.
If your workload is in the bounded-corpus, every-result-matters category and you'd like to compare numbers against your current stack, please reach out. I'd be glad to walk through the comparison honestly.
— Danny