I fitted the new δ-mem research for apple silicon using mlx and openclaw integration! My findings
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
So I’ve been nerding out hard about memory, and have come to the conclusion that context management is too high level and dynamically changing the weights would be best. Luckily, this morning I checked my news feed and saw this new paper! https://arxiv.org/abs/2605.12357
It improves model attention direction without using context or a lora with 20% better answers from their tests! It doesn’t use direct memory queries, or context, but weighted attention direction.
I wanted to try it out on my MacMini 64g Apple silicon to see if it could improve my agents responses. Local agents are already usable, but even a slight improvement would be huge!
I implemented it using mlx (way faster than ollama btw) and tested it with and without my openclaw session history.
https://github.com/elimaine/delta-mem-mlx-sidecar-w-openclaw
Here’s the adaptor I made so it works with mlx: https://huggingface.co/ofthetrees/delta-mem-qwen3-4b-instruct-mlx-adapter
δ-mem paper results (Qwen3-4B-Instruct) showed solid gains:
- Avg vs frozen backbone: `1.10x`
- MemoryAgentBench: `1.31x`
- LoCoMo: `1.20x`
Local normalized mlx tests were more mixed:
(I am fixing this chart, the no context numbers are misleading)
| Result | Plain | δ-mem | Lift |
|---|---:|---:|---:|
| LoCoMo state-only | 0.0500 (misleading, warmup) | 0.1833 | 3.67x |
| LoCoMo session-context | 0.4667 | 0.5000 | 1.07x |
| OpenClaw replay | 0.5701 | 0.6667 | 1.17x |
- Synthetic probes were flat.
- LoCoMo-mini showed surprisingly strong relative gains.
- OpenClaw-style replay showed a smaller but more practically meaningful improvement (`6/8 → 7/8` probes passed).
Overall the paper benchmarks look real, and local tests suggest δ-mem is doing something useful in realistic replay/memory scenarios.
Finally.. the lower results are expected as Apple Silicon cannot run CUDA efficiently. I really want to try it on latest greatest local model for me qwen3.6:27b for mlx, which needs an adaptor model trained. My current estimate is that would cost like 6k to run in the cloud and as I am unemployed (hire me) I cannot afford that rn. If someone with a huge computer wants to pick up where I left off, it’s nearly all there, just need to tweak adaption generation for new qwens attention structure. The original test was already in qwen so that helps a lot.
Thanks for reading! I’m proud of the project, which is my first groundbreaking in the field of open source ai!
[link] [comments]
More from r/LocalLLaMA
-
Anyone else running one of the pre-release branches of MTP support to maintain the higher speeds?
May 16
-
Now that MTP is merged... What's the best outputs you're getting on Qwen 3.6 35B on 2x3090s?
May 16
-
Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP
May 16
-
gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic is Out Now, A Writing Finetune that Aims to Improve Gemma 4 31B it Writing Quality with More Natural English and Better Prose, Good for Creative Writings, Translations and RPs!
May 16
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.