r/LocalLLaMA · · 2 min read

I fitted the new δ-mem research for apple silicon using mlx and openclaw integration! My findings

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

So I’ve been nerding out hard about memory, and have come to the conclusion that context management is too high level and dynamically changing the weights would be best. Luckily, this morning I checked my news feed and saw this new paper! https://arxiv.org/abs/2605.12357

It improves model attention direction without using context or a lora with 20% better answers from their tests! It doesn’t use direct memory queries, or context, but weighted attention direction.

I wanted to try it out on my MacMini 64g Apple silicon to see if it could improve my agents responses. Local agents are already usable, but even a slight improvement would be huge!

I implemented it using mlx (way faster than ollama btw) and tested it with and without my openclaw session history.
https://github.com/elimaine/delta-mem-mlx-sidecar-w-openclaw

Here’s the adaptor I made so it works with mlx: https://huggingface.co/ofthetrees/delta-mem-qwen3-4b-instruct-mlx-adapter

δ-mem paper results (Qwen3-4B-Instruct) showed solid gains:

- Avg vs frozen backbone: `1.10x`
- MemoryAgentBench: `1.31x`
- LoCoMo: `1.20x`

Local normalized mlx tests were more mixed:
(I am fixing this chart, the no context numbers are misleading)
| Result | Plain | δ-mem | Lift |
|---|---:|---:|---:|
| LoCoMo state-only | 0.0500 (misleading, warmup) | 0.1833 | 3.67x |
| LoCoMo session-context | 0.4667 | 0.5000 | 1.07x |
| OpenClaw replay | 0.5701 | 0.6667 | 1.17x |

- Synthetic probes were flat.
- LoCoMo-mini showed surprisingly strong relative gains.
- OpenClaw-style replay showed a smaller but more practically meaningful improvement (`6/8 → 7/8` probes passed).

Overall the paper benchmarks look real, and local tests suggest δ-mem is doing something useful in realistic replay/memory scenarios.

Finally.. the lower results are expected as Apple Silicon cannot run CUDA efficiently. I really want to try it on latest greatest local model for me qwen3.6:27b for mlx, which needs an adaptor model trained. My current estimate is that would cost like 6k to run in the cloud and as I am unemployed (hire me) I cannot afford that rn. If someone with a huge computer wants to pick up where I left off, it’s nearly all there, just need to tweak adaption generation for new qwens attention structure. The original test was already in qwen so that helps a lot.

Thanks for reading! I’m proud of the project, which is my first groundbreaking in the field of open source ai!

submitted by /u/Charming_You_25
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA