Most humanoid robots you've watched fold a shirt or pour a glass of water this year are, more or less, stateless inference machines.
I'm not saying that to take a shot at the teams building them — the policies they're shipping are genuinely impressive. I just mean it literally: the production-grade humanoids running today take in the last few seconds of camera input, feed it to a vision-language-action model, compute the next action, and discard the rest. Power-cycle the robot and it doesn't know it cleaned the kitchen yesterday.
It's not that the field doesn't want long-horizon memory. The most-cited reference architecture for "robot that remembers," published in 2024 and held up as the state of the art, embeds video captions into a vector database — but that vector database runs as a separate server. So the moment the robot loses wifi, the memory layer goes with it. That's a reasonable design for a research demo, harder to live with in the field.
The same shape shows up in autonomous driving. A single vehicle generates 1–10 TB of sensor data per day across cameras, LIDAR, radar, IMU, and CAN — and the standard ROS recording layer caps at around 110 MB/s, which means under load, LIDAR sweeps get dropped, multi-camera setups lose frames, and the "we'll query this later" plan quietly becomes "we'll hope this was captured."
The industry has done remarkable work on the generative side — translating sensor data into actions. The persistence side — keeping the sensor data and the perceptions queryable on the device that produced them — has had less attention. That's the gap we've been trying to close.
What I built
A single embedded binary that runs three engines side by side on the robot itself:
- Vector — perception embeddings with the same on-disk 100% recall search I wrote about last week. The robot stores scenes and recalls similar ones in milliseconds, without phoning home.
- Document — calibration parameters, firmware state, prompt templates. Anything structured that has to survive a power cycle.
- Time-series — joint encoders, IMU, force/torque, tactile arrays, LIDAR sweeps. Ingested at machine rates into the same process as the memory layer.
All three run in one process. No network between them. No external vector server. The robot can run a full shift, log every sensor reading and every perception, and answer "have I seen this scene before?" in single-digit milliseconds.
The numbers, end-to-end
These are from the engine's benchmark suite — same hardware class an x86 edge box on the robot would use. Different deployments will see different numbers, but the shape holds.
Hardware: AMD Ryzen 9 · 64 GB RAM · NVMe Gen 4 2 TB · no GPU.
| Workload | Latency | Recall / accuracy |
|---|---|---|
| Vector recall — "have I seen this scene before?" | 6.35 ms p50 | 100% Recall@10 |
| Filtered recall — "from camera 3, last mission, similar to this" | 7.31 ms p50 | 100% Recall@10 |
| Robot config / calibration write | 0.22 ms p50 | — |
| Time-series sensor ingest | 1.6M points/sec | — |
| Cross-engine join — sensor anomaly → matching memory | 6.61 ms p50 | exact |
| All three engines, 30 Hz frame budget | 0.36 ms p50 vs 33 ms budget | 91× headroom |
The row I care about most is the last one. A 30 Hz robotic perception cycle gives you 33 ms per frame. If your data layer eats 5 ms across three queries, perception has 28 ms left. If your data layer eats 0.36 ms across three concurrent queries — vector, document, sensor, in parallel — perception has 32.64 ms left. That's not a rounding error at frame rate.
The single-digit ms vector recall is the second piece. Humanoid torque control loops run at 2 kHz and IMU sampling reaches up to 4 kHz; perception pipelines run at 10–30 Hz. Recall in 6.35 ms means the perception loop can ask "is this scene familiar?" inside its own frame budget, not as a deferred cloud round-trip.
And the recall has to actually be recall. A recent ICLR 2026 paper — Breaking the Curse of Dimensionality: On the Stability of Modern Vector Retrieval — formalizes why approximate methods lose accuracy as dimensionality climbs, and shows the effect compounds in multi-vector and filtered settings exactly like a robot's perception memory. A perception layer that says "I saw this 87% of the time" isn't a perception layer; it's a coin flip with a polite face. The numbers above are 100% Recall@10 because that's what the use case needs.
What's coming, honestly labeled
I'd rather call out what's not done than have you discover it later:
- 3D LIDAR spatial queries — "every point in this 3D bounding box from the last 2 seconds." Designed, integrating with the next release. Same engine, no separate spatial database. Benchmarks publish when we ship.
- ARM64 (Jetson) deployment — designed, in flight. Today's numbers above run on x86 edge hardware. The path to ARM is mapped and the work is in flight; it isn't shipping today.
If either of those gates a real deployment for you, I'd rather hear about it sooner than later.
What this opens up
For humanoid teams: it's an option for running robot memory locally instead of against a vector server 30 ms away in a cloud region. The robot owns its memory; it survives a hard reboot; it works in a warehouse with patchy wifi.
For autonomous-driving stacks: it's a way to stop dropping LIDAR frames at the recording layer. Every sweep is captured, indexed, and queryable on the vehicle. "What did I see during that near-miss two minutes ago?" can be answered from the car rather than the cloud.
For inspection drones, surgical arms, warehouse AMRs: same shape, smaller footprint. One binary, no extra infrastructure, sensor history that survives every restart.
The interesting part, honestly, isn't really the database. It's that with the right substrate, robots can start to remember between sessions — which is something a lot of people watching this year's demos seemed to assume was already happening.
— Danny