Google's Gemma 4 AI models get 3x speed boost by predicting future tokens
Mirrored from Ars Technica — AI for archival readability. Support the source by reading on the original site.
Google launched its Gemma 4 open models this spring, promising a new level of power and performance for local AI. Google's take on edge AI could be getting even faster already with the release of Multi-Token Prediction (MTP) drafters for Gemma. Google says these experimental models leverage a form of speculative decoding to take a guess at future tokens, which can speed up generation compared to the way models generate tokens on their own.
The latest Gemma models are built on the same underlying technology that powers Google's frontier Gemini AI, but they're tuned to run locally. Gemini is optimized to run on Google's custom TPU chips, which operate in enormous clusters with super-fast interconnects and memory. A single high-power AI accelerator can run the largest Gemma 4 model at full precision, and quantizing will let it run on a consumer GPU.
Gemma allows users to tinker with AI on their hardware rather than sharing all their data with a cloud AI system from Google or someone else. Google also changed the license for Gemma 4 to Apache 2.0, which is much more permissive than the custom Gemma license Google employed for previous releases. However, there are inherent limitations in the hardware most people have to run local AI models. That's where MTP comes in.
More from Ars Technica — AI
-
Altman forced to confront claims at OpenAI trial that he's a prolific liar
May 13
-
Anthropic blames dystopian sci-fi for training AI models to act “evil”
May 13
-
Rivian adds a new onboard AI assistant to its latest software update
May 13
-
The newest AI boom pitch: Host a mini data center at your home
May 12
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.