Engineers leave. Source code shouldn't.

Every few months, another headline. An engineer downloads thousands of files in the weeks before they quit. They turn up at a competitor or start a company of their own. The old employer files suit. The case takes years.

I don't think the engineers are usually villains, for what it's worth. The system invites this — modern software development gives every employee a full local copy of the codebase by design, and most data-loss-prevention tools weren't built for source. The result is that whether the departure was clean or not, the answer to "do they still have the code?" is almost always "yeah, probably."

This post is about that gap, the public cases that keep illustrating it, and what we've tried to do about it in GitDB.

The pattern, in public

A few recent cases from court filings and DOJ releases — every one of them ended in litigation, criminal charges, or both.

Google/Waymo (filed 2016, sentenced 2020). A former Google engineer downloaded 9.7 GB of files — roughly 14,000 documents — before leaving the self-driving group. Civil case settled for ~$245M in equity. Criminal case resulted in an 18-month prison sentence.
Intel (filed July 2024). An ex-employee allegedly downloaded about 18,000 confidential files in the days leading up to his termination — some marked "Intel Top Secret."
X Corp (filed 2024). A former engineering manager allegedly extracted approximately 6 million lines of internal source code and later used it to seed a competing venture.
Google AI (convicted Jan 2026). A former Google software engineer was found guilty of stealing more than 2,000 pages of confidential AI trade secrets, including circumventing DLP by pasting source into Apple Notes, exporting as PDF, and uploading from the corporate network.

These are the ones that became public. The 2025 Ponemon report puts average annual cost of insider threats at $17.4M, and IP theft is implicated in 39% of insider incidents. Most don't end up in court — they end up as a quiet line item in a security write-up.

Why the old playbook misses

The shared shape across every one of these cases is that the actor had legitimate access. They were employees doing what their access let them do — they cloned, they downloaded, they exported. The post-hoc question is always "how do we prove what they took?", and that question is hard to answer when the substrate is a filesystem.

Two specific gaps:

1. git clone is bulk by definition. The moment a clone completes, the entire repository is on a disk you don't control. Revoking the engineer's SSO afterwards is closing a door the file already walked through. There's no granular access record because the action was "give me everything" and the system said yes.

2. There's no per-file visibility after clone. Once the repo lives on the laptop, opening secrets/prod.yaml looks exactly like opening README.md. Endpoint DLP can sometimes catch the upload to external storage, but it can rarely answer "which files inside the repo did this person actually read?"

So when HR signs the exit paperwork, the security team is left reconstructing intent from network logs, VPN sessions, and commit history — and the honest answer to "do they still have the code?" is "we can't fully verify."

What GitDB does differently

We didn't build GitDB to solve insider theft specifically — we built it because AI coding agents made the cost of full-file reads painful. But the same design decisions that cut the token bill also change the answer to the post-departure question. Three properties matter here.

No bulk code access. Ever. By anyone. There is no git clone, no ZIP export, no full-repo download path inside GitDB — not for humans, not for agents. Every single read is an individual, ACL-checked operation on a specific file or line range. Engineers work through the VS Code extension or the web reviewer; agents work through scoped MCP tools. The codebase doesn't live on the laptop because there's no path for it to get there in the first place.

Every read lands in an audit trail. Each file read is one row in our underlying audit trail — actor, timestamp, file, line range. Not "the user accessed the repo" but "this identity read this function at this nanosecond." When you need to answer "which files did this person actually read this quarter?", the answer is a single query.

Bulk-access patterns become anomalies, not noise. Because no normal workflow needs hundreds of files in sixty seconds, a burst of that shape is a real signal. The system can throttle the session and surface it to whoever's on call. This is the part of the architecture that turns "we can't verify" into "we already paused that session and here's the file list."

What the exit interview can sound like

The line in our design doc I keep coming back to is this contrast:

Question	Cloned-repo world	GitDB world
"Do they still have the code?"	"We can't verify — they had a local clone."	"No — code was never on their device. Token revoked."

That's the entire pitch. It's not that determined attackers can't ever find a way; it's that the bulk-export path — the one that produced every public case above — isn't there. A person can still take screenshots of one function on the web reviewer. They can still memorize an algorithm. We're not solving every edge of insider risk. We're closing the door on the loud, repeatable pattern that ends up in court.

What this doesn't solve

A few honest caveats, because the topic deserves them:

Screenshots and human memory. If someone reads a function and writes a similar one elsewhere, GitDB doesn't stop that. No technical control does.
Determined exfiltration over time. A user with read access can, in principle, copy small amounts of code by hand over months. We make it costly and visible (every read logged), but we don't make it impossible.
Legal recourse. If a case ends up in court, GitDB's audit trail is evidence. It isn't a substitute for the legal work — but the evidence is precise instead of reconstructed.

If you're a security or engineering leader thinking about the gap between "our SSO revocation is good" and "we know what they actually read in their last two weeks," I'd be glad to compare notes. We're in private beta and trying to learn from teams that have been on the receiving end of this problem in real life.

— Danny