r/LocalLLaMA · · 1 min read

I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

This is still a work in progress, but since recording the video, I added callbacks for tool use, more tests, and published it as a cargo crate. Currently working on speeding up the prefill.

The decode speed is almost the same on my Ryzen 7950x (~37 tokens/s), but the prefill speed is not yet optimized (almost the same as decode).

This model can comfortably run on a machine with 16GB of RAM. Its memory usage will fit within ~7GB. You can reuse the weights between multiple Agent instances, each with their own KV cache. You can also clone Agent object instances if your agents have the same prompt so that you don't need to repeat the prefill work on the prompt.

submitted by /u/maximecb
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA