FishMem vs mem0 OSS: An Honest, Reproducible Memory Benchmark

We benchmarked FishMem against mem0 OSS on a frozen, shared harness — 233 held-out questions, one LLM judge. Here are the numbers, including the category where mem0 wins, and how to rerun every one yourself.

More retrieval is not the same as better recall

Most memory benchmarks reward one thing: pulling more text into the context window. That looks good on a leaderboard and falls apart in production, where every token is latency and cost, and where the real question is not whether the right fact appeared somewhere in 25,000 tokens, but whether the agent answered correctly from a few hundred. So when we benchmarked FishMem, we measured accuracy and cost together, on held-out questions, with a frozen configuration — and we are publishing the result that does not flatter us alongside the ones that do.

TL;DR

Overall recall: 62.2% vs 54.5% LLM-judge accuracy against mem0 OSS.
Temporal questions: 69.8% vs 28.3% — a 41.5-point gap, the largest.
Multi-hop questions: 54.8% vs 57.1% — mem0 OSS wins this one, and we are publishing it anyway.
Search latency p50: ~420ms vs 2.4 to 3.8s for a leading hosted alternative, measured from the same client — roughly 6 to 9 times faster on every turn of the agent loop.
Everything below is reproducible: 233 held-out questions, a frozen config, the same harness and the same LLM judge for both systems.

Why this matters

Memory lives inside the agent loop. Every turn pays the retrieval cost, so latency compounds and token bloat lands directly on your inference bill. A memory layer that wins a benchmark by stuffing more into context is quietly making your production agent slower and more expensive. The honest question is whether the layer returns the right fact, quickly, without dragging the whole history along — and whether you can verify the claim yourself.

How we measured

We compared FishMem against the workload mem0 OSS (TypeScript) runs, under conditions designed to be checkable:

233 held-out questions never used for tuning.
Frozen configuration — no per-benchmark tweaks on either side.
One shared harness and one LLM judge, identical for both systems, so the scoring is apples to apples.

One clarification up front, because it matters for honesty: we did not compare against the numbers mem0 publishes for its Python stack. Those use different prompts and a different implementation, so they are not comparable. We measured mem0 OSS TypeScript ourselves, on the same harness, and that is what the numbers above reflect.

Results

Overall recall

Across the held-out set, FishMem answered 62.2% of questions correctly by LLM judge, versus 54.5% for mem0 OSS.

Temporal reasoning

The widest gap is on questions about when something was true: 69.8% vs 28.3%, a 41.5-point difference. This is where bi-temporal facts earn their keep — FishMem records both when a fact was true and when it was learned, so a superseded fact is superseded rather than silently overwritten.

Multi-hop

On multi-hop questions that chain several facts together, mem0 OSS leads, 57.1% to 54.8%. We are not going to hide a result we lose. It is a real gap and an area we are actively working on; if multi-hop chaining is the core of your workload, you should know that before you choose.

Search latency

Measured head-to-head from the same client, FishMem search lands at a p50 of ~420ms, versus 2.4 to 3.8 seconds for a leading hosted alternative on the same queries — about 6 to 9 times faster per turn.

Why the numbers move

FishMem does not retrieve by maximizing context. It extracts durable facts on write, stores them on a hybrid graph and vector layer, and retrieves with four signals at once: vector similarity, keyword match, temporal ordering, and graph diffusion across connected entities. The temporal signal is what carries the time-based questions; graph diffusion is what surfaces a relevant fact whose words do not match the query. The cost side follows from the same design — returning a few right facts instead of a large slice of history is what keeps latency and tokens down.

Where this is honest about its limits

mem0 wins multi-hop. Stated above, repeated here so it is not buried.
One judge, one dataset. An LLM judge is a proxy for human judgment, and 233 questions is a held-out set, not the whole world. A different dataset will move the numbers.
Not mem0 Python or mem0 hosted. The comparison is against mem0 OSS TypeScript on our harness.
Your data is not our data. The only number that ultimately matters is the one you get on your own workload, which is exactly why we made that easy to produce.

Rerun it yourself

The harness, the prompts, and the judge are open. The point of an honest benchmark is that you do not have to trust ours: you can clone the engine, point it at your own questions, and produce your own table, including the cases where we lose. Start with the open-source engine and the docs.

Try FishMem

If you want the same engine without operating it, grab a hosted API key — the Hobby plan is free — and point your existing mem0 client at it; the API is mem0-compatible. New here? Start with Introducing fishmem. We would rather show you a benchmark we lose than one you cannot reproduce.