All posts
8 min readDanny Yau

What 100% recall could mean for legal AI (and other high-stakes memory)

Legal AI tools hallucinate 17–33% of the time per a peer-reviewed Stanford study. A lot of that comes from retrieval, not generation. Here's the memory-layer math, and what we benchmarked.

The single most valuable application of AI in 2026 might be law. Harvey just closed a round at an $11B valuation on $190M ARR, with more than 100,000 lawyers across 1,300 organizations on the platform and 80 of the Am Law 100 firms paying for it.

And the same teams will tell you, openly, that the hardest unsolved problem in the category is hallucination. Harvey's leadership calls it "the single most important technical challenge in legal AI". A peer-reviewed Stanford study published in the Journal of Empirical Legal Studies measured the two largest legal-research AIs in production and found them hallucinating between 17% and 33% of the time — including, in one documented case, answering a post-Dobbs abortion question using the undue-burden standard that Dobbs had explicitly overruled.

This post is about the part of that problem that lives in the memory layer.

Why "hallucination" is partly a retrieval problem

When an AI tool answers a legal question incorrectly, the failure can come from two places: the generation step (the model invented something) or the retrieval step (the right document was in the corpus but the retrieval system didn't surface it, so the model had to invent).

The Stanford team made this explicit. Their analysis of where the failures originated kept landing on the second category — "retrieval is particularly challenging in law," they wrote, because legal queries often span temporal and jurisdictional variation (overruled precedents, circuit splits), and a vector index that misses 5–15% of relevant documents per query is, in practice, an index that lets the model confidently generate from an incomplete picture.

This is where the memory layer architecture starts to matter. Most production memory services for AI are built on one of two patterns; both are well-engineered, both are good fits for their original workload, and both have specific properties that get harder to live with under the legal-grade bar.

Pattern 1 — Smart extraction

You hand the system a conversation or a document. An LLM reads through it, pulls out facts — parties, dates, clauses, decisions, names — and stores those facts in a structured store. On the way back in, a retrieval pass finds the right facts and threads them into the prompt.

Teams doing this well are publishing real numbers — 91–94% on LongMemEval and LoCoMo, retrieval calls under ~7,000 tokens. For consumer chatbots — "the assistant should remember that the user prefers dark mode" — this is genuinely the right shape.

The legal-grade problem with extraction: the extraction LLM decides at write time what counts as a fact. Eight months later, when a partner asks the niche question the extractor didn't know to anticipate, the original text is still on disk somewhere, but the searchable memory is the summary — and the summary doesn't contain the thing the partner now needs.

Pattern 2 — Knowledge graph + ontology

The system parses content into entities and relationships, builds a graph, resolves contradictions automatically. Good implementations hit sub-300 ms p50 and outperform pure vector approaches on the conversational benchmarks.

The legal-grade problem with ontologies: the ontology is defined by the service. The corpus a Harvey-style application has to work over is not a clean person/product/event schema — it's contracts, opinions, motions, briefs, statutes, regulations, redlines. Every law firm's notion of what matters is different. The moment your domain stops looking like the ontology, the abstraction starts pushing back.

I want to be clear about something: both patterns are legitimate engineering, and the teams behind them have done real work. If your workload is conversational personalization, either is probably the right answer. I'm not arguing against them; I'm arguing they're not the right fit for the "every word counts and you can't summarize" use case.

A different choice — UQL with a 100% recall floor

The memory layer we built doesn't extract facts. It stores what you give it, indexed in a way that lets you compose recall queries with whatever structure you want, with the guarantee that what you stored is what comes back. The query language has five composable levels — UQL.

The recall floor matters because of the Stanford finding above. If your retrieval layer can't reach 100% recall, the model is being asked to reason from a sometimes-incomplete view of your corpus, and "sometimes" is exactly where hallucinations enter the system.

We ran an end-to-end benchmark across all six UQL phases — write, filter, join, analytics, cross-check, pipeline. Here are the numbers, on dbpedia-openai cosine, full dimensions, no truncation.

Hardware: AMD Ryzen 9 · 64 GB RAM · NVMe Gen 4 2 TB · no GPU.

PhaseWhat it testsRecall@10p50p99
L1 — INTUITIONVector search100%4.70 ms4.93 ms
L2 — EPISODICFiltered vector search100%5.51 ms6.02 ms
L3 — RBAC STATESQL → Vector join100%7.73 ms7.96 ms
L4 — ANALYTICSGROUP BY / SUM / AVG on results100%5.46 ms5.87 ms
L5 — CROSS-CHECKCorrectness across modes100%
L6 — PIPELINEMulti-step pipeline plans100%0.24 ms8.69 ms

Every phase landed at 100% recall@1, @10, @100, and @1000 — meaning the result set is identical to a brute-force scan, not just "close enough." The L5 row doesn't measure latency by design (it's a correctness pass); the others are real p50 / p99.

The post on 100% recall on disk and the 3072-dim follow-up cover where the recall floor comes from at the vector layer. UQL is what sits on top — the contract that lets a legal-AI application (or anything else) compose recall the way it actually needs to.

What this changes for legal-grade workloads

The Stanford queries that broke the production tools were almost all of one shape: "find me the document or precedent that matches this question, given a constraint on jurisdiction or date or category." That's not a pure vector lookup; it's a vector lookup constrained by structured filters.

Three concrete patterns that come up:

The clause-level question. "Find every contract from 2022–2024 where the indemnification cap was tied to a percentage of fees rather than a fixed dollar amount." On extraction-based memory, this answer depends on whether the extractor classified that pattern as a fact at write time. On UQL, it's an L2 query: semantic search for the clause pattern, filtered by date_range. The full text is preserved and the search runs over what was actually written, not what was summarized.

The composed query. "Show me opinions from Ninth Circuit, after 2020, that cite Dobbs and discuss the undue-burden standard." That's three structured predicates (circuit = "9th", year > 2020, cites_dobbs = true) plus a semantic match ("undue-burden standard"). On most memory services, this is three or four API calls and an application-side intersection. On UQL it's a single L3 SQL → Vector join — one round-trip, 100% recall over the eligible set.

The aggregate over a corpus. "Group every motion in the firm's history by judge, and surface the win rate where the motion cited a specific holding." That's L4 analytics over a vector-retrieved subset, all in one call. The lawyer asks the question once and gets a count, not a 47-step Python notebook.

Honest caveats

A few, because the topic deserves them:

  • We are not Harvey, and we are not trying to be. Harvey is a legal-AI application — the end product the lawyer uses. We're an infrastructure layer that an application like Harvey (or one of its competitors, or an in-house legal team building their own) could choose to run on. The blog above is about what becomes possible at the layer below the application, not about replacing Harvey.
  • The benchmarks above are on dbpedia-openai cosine. That's a standard public benchmark with real query sets. For a legal corpus the numbers would need to be re-run against that corpus. We're confident in the engine's behavior; we don't want to overstate cross-domain transfer.
  • We don't auto-extract facts at this layer. If your application needs that (most legal-AI applications do), you'd run an LLM extraction step at write time and store the structured output in our engine. The 100% recall floor is on what you store, not a substitute for the structure.
  • Maturity. We're in private beta. Smaller customer base, fewer integrations than the established memory services. That's real, and it's the right thing to weigh against.

If you're building in this space

If you're building legal AI, or any application where "the model can't be allowed to guess what's in the corpus" is the rule, and the retrieval layer is part of the cost calculus you're working through — please reach out. I'd genuinely like to compare notes on what you're using today, where it's working, and where it isn't.

— Danny