Everyone's launching wrappers. Nobody's going deep.

Last summer I sat down to build memory for AI agents and assumed, like everyone else, that the hard part would be the reasoning layer. Pick the right model, tune the prompts, maybe add a knowledge graph if you're feeling ambitious. The retrieval side seemed solved. Embed your text, store the vectors, run similarity search, done. Every tutorial said so. Every product on the market was built that way. I had no reason to question it until I started measuring.

What I found over the next three months broke my assumptions about how AI products are actually built, and more importantly, how they're not built. The gap between what companies claim their systems do and what those systems actually do when you test them is enormous. Not in a hand-wavy "could be better" sense. In a measurable, documented, reproducible sense that I can point to with specific numbers because I ran 26,000 benchmarks to get them.

What those benchmarks actually show

The first thing I did was run a standard benchmark against my own retrieval pipeline. Ground-truth questions, known correct answers, straightforward evaluation. The results were bad enough that I thought I'd made an implementation error. So I checked the code, ran it again, got the same numbers, and started digging into the individual failures.

I categorized 357 failed retrievals by hand. This is the kind of work nobody wants to do because it's slow and tedious and there's no shortcut. You read each failure, figure out what the system retrieved instead of what it should have retrieved, and classify why. Some failures were temporal: the system couldn't distinguish between something the user said last week and something they said six months ago. Some were entity-level: it confused which person said what in a multi-party conversation. Some were compositional: the answer required combining information from two different memories and the system only surfaced one. After about two weeks of this I had a finding that reframed the entire problem: 92% of the failures were retrieval failures, not reasoning failures. The information existed in the database. The system had it. It just couldn't find it when asked.

This distinction matters more than it sounds like it should. The entire AI product space is having a heated debate about which LLM to use, which graph database gives better relationship mapping, whether you need RAG or long context or fine-tuning. Almost nobody is talking about whether the retrieval layer underneath all of that actually works. They assume it does because it returns results and the results look plausible. But "looks plausible" and "is correct" are very different things, and the only way to know which one you're getting is to measure.

I confirmed this with an oracle test. Bypassed retrieval entirely, fed the model the full conversation as context, and watched accuracy jump to 93.8%. The data was always there. The search just couldn't surface it. The librarian was broken, not the library.

This is a strange thing to discover because it means the entire field was optimizing the wrong layer. If you spend six months improving your reasoning pipeline and your retrieval is only surfacing 60% of the relevant information, you're making a better engine for a car with flat tires. The car still won't go anywhere useful.

The 56-combo matrix

Once I knew retrieval was the bottleneck, I needed to understand how much the choice of embedding model and reranker actually mattered. So I built what turned out to be a fairly large test rig: 7 embedding models crossed with 8 rerankers, 56 combinations total, each evaluated against 1,540 ground-truth questions. That's roughly 26,000 individual benchmark evaluations when you account for the different retrieval depths and parameter sweeps.

Nobody had done this comparison before, at least not publicly. The reason is straightforward: it's boring. There's no clever trick that lets you skip the work. You configure each combination, run the evaluation, wait for it to finish, record the results, move to the next one. It took weeks. I was ordering DoorDash at 2am because I forgot to eat again, watching numbers scroll by on a terminal while everyone on Twitter posted screenshots of apps they built in an afternoon.

The total spread across all 56 combinations was 3.2 percentage points, from 89.9% to 93.1%. That sounds tight until you realize the implications. First, most products ship without testing even a single combination. They grab whatever embedding model the quickstart guide uses and never question it. Second, the difference between the best and worst combination was enough to flip the user experience from "this mostly works" to "this misses important things regularly." Third, and this surprised me more than anything else in the entire project, the cost of the model barely correlated with performance. A $0.40 per million token model with 100 retrieved memories beat a $15 per million token model with 15 retrieved memories. The cheap model with better retrieval recovered 82% of errors. The expensive model with worse retrieval recovered 54%.

That's not a marginal finding, that's a complete inversion of the default assumption that better models produce better results. Retrieval quality dominated model quality so thoroughly that optimizing your search pipeline was worth more than upgrading to a model that costs 37 times as much. Every dollar spent on a better model was a dollar that would have been better spent on a better embedding or reranker configuration. And almost nobody in the industry was thinking about it this way.

I also found a configuration bug in my own code during this process. A script was silently loading MiniLM instead of the GTE ModernBERT reranker I thought I was running. Nobody noticed because nobody was measuring. No error, no warning, just quietly degraded accuracy that looked normal because there was no baseline to compare against. This is the kind of thing that's sitting in production systems everywhere, silent misconfigurations that degrade performance without any visible signal, and the only way to catch them is to actually benchmark your retrieval against ground truth. If you never measure, you never know your system is running the wrong model.

What wrapper companies skip

The wrapper pattern in AI products follows a predictable path. Take an API from Anthropic or OpenAI, build a UI on top, add some prompt engineering, deploy. The product works in the demo because demos are curated. It works for the first few users because early adopters are forgiving and they fill in the gaps themselves. It starts failing silently at scale because nobody built the infrastructure to detect failures, and the failures are invisible because they don't produce errors. The system returns something. It just returns the wrong thing.

The specific thing that gets skipped is the measurement layer. Not logging, not observability in the DevOps sense, but actual evaluation of whether the core intelligence pipeline produces correct outputs. This is partly because measurement is hard and partly because the results are scary. Once you start measuring, you find out how often your system is wrong, and that number is usually higher than anyone wants to see. There's a strong incentive to never look. If you don't measure, every retrieval looks successful. The results come back, they seem relevant, the user doesn't complain loudly enough. Ship it.

When I started building what became TrueMemory, I made three architectural decisions that came directly from the benchmark data. SQLite instead of a dedicated vector database, because the constraint forced a better hybrid search pipeline and eliminated the ability to blame infrastructure for retrieval failures. When your memory system is one file, there's no "the cluster might be having latency issues" excuse. If retrieval is broken, the architecture is broken.

A neuroscience-inspired encoding gate that filters what gets stored based on novelty, salience, and prediction error, because the data showed that less noise in storage produced less noise in retrieval. I'd been reading papers on how the hippocampus and amygdala interact during memory formation. Biological memory doesn't record everything. It runs incoming experience through a filter that evaluates whether something is new, whether it matters, and whether it violates expectations. Most of what you experience gets discarded. That's not a limitation, it's the core mechanism that makes retrieval work. I modeled the encoding gate on exactly this process: three signals combined into a weighted sum with a threshold that determines whether a memory gets stored or dropped. The benchmarks supported this approach.

And a 6-layer retrieval pipeline combining sparse full-text search, dense vector search, reciprocal rank fusion, and cross-encoder reranking, all tuned across those 56 combinations. None of these decisions are the kind of thing you arrive at by moving fast. They came from sitting with bad results for weeks, reading neuroscience papers, and running thousands of evaluations to validate each change. The whole system runs on a Raspberry Pi for $12 a month and scores within 3 points of systems requiring GPU infrastructure that costs $150 to $400 a month.

The platform convergence problem

There's a structural issue with the wrapper model that goes beyond individual product quality. Anthropic is shipping native memory for Claude. OpenAI is building memory into ChatGPT. Google's Gemini remembers conversations. Every major platform is converging on memory as a built-in feature.

When the platform ships a native version of your product, the wrapper dies. Not because the platform version is better, but because it's already installed, already integrated, and already free. The platform doesn't need to build a good version. It needs to build a good enough version. Meeting summarizers learned this lesson last year when Zoom, Google Meet, and Microsoft Teams all shipped native summarization within months of each other. Those wrapper companies didn't fail because their product was bad. They failed because the platform they depended on decided to build the same thing, and the platform had the distribution to make a mediocre version beat a great one.

The only defense against platform absorption is depth. Not UI depth, not onboarding depth, but technical depth the platform can't replicate by adding a feature checkbox. Published research documenting why the standard approach breaks. A retrieval pipeline tuned across 56 embedding and reranker combinations. An encoding gate modeled on biological memory formation. An arXiv paper with methodology, controlled benchmarks, and reproducible results.

The wrapper companies are building on rented land. The landlord is already building the same features. The only products that survive are the ones that went deep enough to own something the platform can't just add.

Josh Adler is a researcher at TrueMemory, a Sauron company. Research: arXiv:2605.04897. More at joshadler.com.