All posts
6 min readDanny Yau

What GitDB is, and why I built it

AI coding agents spend 60–80% of their tokens just finding things in the codebase. That's grep hell. Here's what we built underneath instead.

Three honest numbers about where AI coding agents actually spend their tokens today:

  • 60–80% of an agent's tokens go to finding things, not answering your question. Multiple independent measurements converge on this range — Augment Code's research, the 70%-wasted-tokens write-up on Medium, Jake Nesler's context-compression analysis. The wedge isn't generation. It's retrieval.
  • A single grep "auth" against a real codebase returns 40+ hits and ~8,000 wasted tokens of context the model then has to mentally filter. Documented in the same measurements.
  • An agent like Claude Code routinely reads 25 files to answer a question about three functions — because filesystem-level retrieval has no way to know which three the question is really about. Same source.

I want to be careful about how I frame this: the models are not the problem. The substrate underneath them is.

Grep hell

Every modern AI coding agent — Claude Code, Cursor, Codex, the in-house ones — talks to your codebase the same way you used to from a terminal: grep, find, cat. That made sense when the actor was a person who could glance at the first three hits and decide "yeah, the one in auth/middleware.rs is the real authenticate."

For a model, the same primitives are very expensive in three specific ways:

1. Grep doesn't know what's a function and what's a comment. grep "authenticate" returns matches in every comment, every test, every error string, every README. The agent gets 40+ hits and pays the token bill to read enough surrounding context to figure out which match was the function it cared about.

2. The filesystem has no concept of callers. The question "where is process_payment called from?" has no native answer. The agent issues several greps, reads the matching files, and then has to reason about which call sites are real call sites versus coincidental string matches. Cursor openly notes this gap — it operates on files but not on systems; a fix in an API handler doesn't propagate awareness to the frontend three directories away that depends on the response shape.

3. The file is the unit, not the function. When the agent wants one function, it has to read the whole 1,200-line file. Targeted-read APIs help (Claude Code's Read accepts offset and limit and is reportedly 10–50× cheaper for a 20-line read) — but only if the agent already knows the right line range, which it doesn't, until it has read the file once.

The cumulative effect on the bill is loud, but the cumulative effect on quality is the part I think doesn't get enough attention.

The 60% context cliff

A 200K-token model's working context window is not 200K. Output quality starts degrading at roughly 60% context utilization, and around 130K the degradation is no longer gradual — it's a cliff. So when an agent burns 70K tokens on grep noise and full-file reads, it isn't only spending money it didn't need to. It's actively reducing the quality of the remaining work by feeding itself documents the model now has to mentally filter past. The dollar cost is the visible problem; the quality cost is the invisible one.

The fix in the literature is the same shape every time: replace grep with structural queries. The most carefully reported case study takes a 200-file TypeScript project, swaps grep-based search for AST-level subgraph retrieval, and watches relevant file reads collapse from 40 to 5 — Claude Code's input tokens drop from 8,200 to 2,100 on the same task. Same model, same prompt, different substrate.

What we landed on

GitDB is what we ended up with when we tried to keep everything Git is good at, but let agents query the codebase the way they'd query any other data store.

The repository lives on a server. Agents and humans both reach it through a query API. They never clone. The interface isn't "give me file X"; it's:

  • find_function("authenticate") — returns the AST node plus the line range, regardless of which file or how many comments mention the word.
  • find_callers("process_payment") — returns every call site with file + line, with the string-match false positives already excluded.
  • read_lines("src/auth.rs", 142, 198) — returns those lines, not the file.
  • write_function(...) — edits the function in place, lets the server compute the diff.

The point of the API surface is to give the agent a function-level vocabulary instead of a file-level vocabulary. Three things change as a result:

  1. Targeted retrieval becomes the default. An agent that wants one function gets that function (~470 tokens), not the file it lives in (~9,000 tokens). The 25-file read pattern collapses to single-digit reads.
  2. Cross-file relationships are first-class. "Who calls this?" is one tool call returning a structured answer, not a multi-step grep loop.
  3. Every read is logged. Not "the user accessed a file" — which lines, by which identity, at which timestamp. It's a side effect of the design that turns out to matter the first time an agent does something surprising.

Nothing about Git itself goes away. Every commit is still a commit, every diff is still a diff, every pull request still pulls and requests. We just stopped putting a filesystem between the agents and the work.

Two stacks of savings, not one

The token reduction we measure isn't one wedge. It's two, stacked.

~95% on input tokens, from function-level retrieval. The straightforward one. When an agent calls find_function("authenticate") + read_lines(...) instead of grepping and dumping the surrounding file, the same answer arrives in ~470 tokens instead of ~9,000. That lines up with the 65% reduction case study cited earlier; on our own internal benchmark suite the number sits closer to 95% per call because the lookup also eliminates the grep round-trips that come before the read.

~83% on memory tokens, from recall over re-discovery. The less obvious one. Without persistent memory, every agent run re-derives the same conclusions — "what did we decide about rate limiting last week?" gets rebuilt from scratch every time. GitDB stores the agent's prior reasoning as recallable memory, so a subsequent run pulls a ~200-token recovered context instead of burning 2,000 tokens re-thinking. SAGE (2025, MSR) measured 59% on a related workload; stacked with the AST retrieval above, we see closer to 83% on coding tasks where the agent's prior context matters.

The two stack multiplicatively. The same feature that used to burn ~1.4M tokens on grep hell runs at ~140K with both wedges on — same model, same prompt, same answer, roughly an order of magnitude less.

A full walkthrough of the token math, with the comparison script, is in the cost write-up.

If your token bill on Claude Code or Cursor or your in-house equivalent feels heavier than the actual work being done, that's the gap we built this for. Reach out and I'll show you the numbers honestly.

— Danny