Lemonade v10.8: auto memory management, cloud offload, Omni improvements, and call your local models as MCP tools
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| v10.8 is out, so here's a project update on what landed. This was a 20-contributor release in just 7 days! Smarter memory and context management Dynamic VRAM management now auto-unloads idle models and downsizes their KV-cache to reclaim GPU memory on the fly, plus model pinning so the ones you want hot never get evicted. Automatic context sizing means Lemonade picks the context length from your available memory and the model architecture instead of you tuning it by hand. Cloud offload, sitting next to your local models Sometimes you want a bigger model than your box can run. There's now a provider-agnostic offload backend so you can serve chat completions from any OpenAI-compatible provider (Fireworks, OpenRouter, Together, OpenAI) right alongside local models, and switch from the CLI or UI. Local-first, with cloud as an option, not a default. Eventually we want to enable applications to route between client and cloud based on their own routing policies. LMX-Omni image generation expansion LMX-Omni now exposes controls like size, steps, etc. for image generation. You can also pull and share custom omni models straight from Hugging Face. An MCP gateway, so your local models become tools There's now an MCP gateway ( Lots of platform expansion The cross-vendor push continued across AMD, NVIDIA, and more: NVIDIA GB10 (Blackwell) arm64 CUDA, TheRock ROCm on Windows for Radeon RX GPUs, ROCm for the Radeon 840M/860M iGPUs, whisper.cpp moved to ROCm on Windows and Linux, a dedicated Debian 13 build, and a CDNA datacenter GPU detection fix. Also we just got this sick new chat CLI! Full release notes are on GitHub: https://github.com/lemonade-sdk/lemonade/releases/tag/v10.8.0 [link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.