Why We're Building Spectral

A memory system for AI agents — and a wager that honesty is a competitive advantage.

Most AI agents have amnesia. They are brilliant in the moment and blank the instant the context window fills. The industry's answer has been to bolt on a memory layer: embed everything, store the vectors, retrieve the nearest neighbors, stuff them back into the prompt. It works, sort of, and it has produced a crowded field of systems that all look roughly alike and all advertise roughly the same thing — a big number on a benchmark called LongMemEval.

We are building Spectral, and we made a different bet. Two of them, actually. The first is architectural. The second is about how you earn trust, and it turns out to matter more.

The architectural bet: no model in the recall path

Spectral retrieves memories without calling a language model to do it. Recall runs on local full-text search and a graph of relationships between memories — deterministic, inspectable, fast. There is no embedding service to host, no vector database to operate, and critically, no LLM sitting in the hot path of every retrieval deciding what's relevant.

This is an unfashionable choice. The fashionable choice is to throw a model at every stage, because models are flexible and flexibility hides a lot of design sins. But a model in the recall path is a tax you pay on every single query — in latency, in cost, and in a kind of unaccountability, because you can no longer say precisely why a given memory came back.

The measured consequences of the other path:

Retrieval latency: 17 milliseconds at the median. CPU-local, no network round-trip to an embedding service, no vector index to traverse.
Memory-layer overhead: 169 tokens per query. This is the cost of the memory machinery itself — the work Spectral does to find what's relevant, separate from the answer-generation that every system pays for. It is roughly half a percent of total system cost. The recall path's own model-token cost is structurally zero, because there is no model in it.

We are, as far as we can tell, the only system that publishes that memory-layer-overhead number in isolation. Others report a single blended figure, or don't report cost at all. That isn't an accident of disclosure — it's an architectural difference. When your recall path is deterministic, you can measure exactly what it costs. When it's a model call, the cost blurs into everything else.

What the accuracy number actually is

On LongMemEval — the standard benchmark for long-term memory in agents — Spectral scores 81.5% across all five question categories on the full 500-question set, with query expansion enabled.

We want to be precise about that number, because precision is the whole point. It is measured, not projected. The denominator is clean. We pre-registered our expected range before the run and it landed exactly inside it. When we found a bug in our evaluation harness that was scoring three correct answers as wrong, we fixed it — and then we audited whether the same bug had ever scored a wrong answer as right, because a bug that only ever helps you is a suspicious bug. It hadn't. The 81.5% is honest in both directions.

Accuracy per question category:

Single-session-assistant — 90.9%
Knowledge-update — 87.2%
Single-session-user — 85.7%
Temporal-reasoning — 82.7%
Multi-session — 74.0%
Single-session-preference — 52.0%

We could tell you 81.5% is state-of-the-art. It isn't. There are published numbers in the mid-90s. We could close that gap tomorrow by swapping in a more powerful model to read the retrieved context — and we measured exactly that, and it recovered nothing, because the failures that remain are not retrieval failures. They are synthesis failures: the evidence is in the context, and the reasoning over it — counting, date arithmetic, multi-step inference — is where it breaks. We classified all 94 of our remaining failures by hand to know this rather than guess it.

So we are not going to chase the headline number by doing the things that inflate headline numbers: a heavier reader the product doesn't actually use, a vector-hybrid rebuild that would make Spectral indistinguishable from everything else, or tuning so hard on these 500 questions that the score stops meaning anything on the 501st. Each of those buys a better number and a worse system. We have declined all of them, and we have the measurements that show what each would and wouldn't have bought.

The real bet: honesty as a moat

Here is the thing we actually believe, the reason this is a studio project and not a weekend hack.

The memory-systems field has a trust problem. The numbers are big and the methodologies are thin. Benchmarks get gamed, denominators get massaged, and the gap between "what we published" and "what you'd get" is wide enough that serious evaluators have started publicly auditing each other's claims. In a market like that, a trustworthy number beats an impressive one — because the impressive one is a liability the moment someone checks it.

So we built Spectral the way you'd build something you intend to be checked. Every benchmark claim has a clean denominator and a published methodology. Every cost figure separates the memory layer from the answer-generation every system shares. Every failure is classified to its mechanism, so our limitations section is a measured artifact, not a paragraph of hedging. We pre-register what we expect before we run, so we can't retrofit the story to the result. And when something we hoped would work didn't, we wrote down that it didn't, with the numbers, so we'd never waste the effort re-attempting it.

This is slower. It cost us, more than once, the satisfying version of a result. But it produces something the fast version can't: a system whose every claim survives contact with a skeptic, because it was built by skeptics.

Where this goes

The benchmark measures retrieval over a frozen corpus: ingest a history, ask a set of unrelated questions, score each in isolation. The capability we care most about is one that shape of test cannot see by construction — a recall-to-recognition feedback loop, where what an agent successfully retrieves and uses should reinforce what surfaces next, so memory gets sharper with use rather than just larger.

We are building that loop, and we are holding ourselves to the same standard we hold the accuracy number. Right now the mechanism is wired but does not yet move retrieval the way we intend — we built a controlled test for it, the test said the effect is currently too small to change what comes back, and it told us exactly why. So we are not going to claim it works until it measurably does. That's the whole point of building this way: the studio thesis isn't that our capabilities are finished, it's that our claims about them are true and checked. When the loop produces a measurable lift, we'll show you the measurement. Until then it's a direction, stated as one.

That's the bet in one line. The crowded part of this field is competing on a saturating benchmark with numbers nobody fully trusts. The open ground is an architecture that's cheaper, faster, inspectable, and honest — and a capability the standard test can't even measure, that we intend to earn the right to claim. We'd rather build there.

We're building Spectral because we think the next generation of agents will be defined less by how much they can hold in a context window and more by what they reliably remember, recognize, and bring back when it matters. And we think the teams that win that will be the ones whose claims you can check.

Spectral is the memory engine behind Permagent, our local-first AI agent OS—where these ideas get put to work.

Spectral is open source under the Apache 2.0 license. You can check ours.

The architectural bet: no model in the recall path

What the accuracy number actually is

The real bet: honesty as a moat

Where this goes

Newsletter