r/LocalLLaMA · May 28, 2026 · 1 min read

Granite 4.1 Architecture Changes?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hey all. Anyone know why IBM decided to return to a pure transformer model for Granite 4.1? They mention in their release post that it's easier to fine-tune than Granite 4, but surely the drawbacks outweigh this benefit, especially for a model that is often used for very well-defined basic tasks like document summarization, translation, et cetera, which don't particularly require fine-tuning? Perhaps it's a consideration for tool calling?

Granite 4 used a hybrid mamba attention model. It had a variety of dense and MoE sizes that cover a lot of use cases and setups. I'm relatively GPU poor and it's the first model that let me ingest entire 100+ page documents, and it remained at a usable speed even with its context almost filled. On my modest hardware (8GB VRAM, Intel Alchemist dGPU) I can have the full 128k context without even quantizing the cache, it ingests at ~1000 tokens per second, and generates at ~40 tokens per second. For basic document-related or highly structured tasks, that's practically unbeatable from what I've seen.

By contrast, the "improved" Granite 4.1 only goes up to ~14k context (q8 quantized cache) on my hardware, and ingests and generates at less than half the speed (300/s ingestion, ~15/s out). Partly this is also because I'm comparing the old 7B MoE to new 8B dense (4.1 does not offer MoE for some reason), both Q4KM. It's hard to even evaluate whether the output is truly "better" for my use cases, because it can't even handle many of them.

Anyone have any insight on whether IBM intends to continue offering the mamba hybrid architecture in future models? I've looked around online for this, but can't find much conversation about it.

submitted by /u/the-salami
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA