r/LocalLLaMA · May 15, 2026 · 1 min read

Adding E4B audio encoder to larger models

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I am curious if anyone here has tried doing this, I did a bit of digging and it seems like it would be easier to do then I first thought and would like to ask ask for correction if my assumptions are wrong. Here is how I would go about it:

Extract the 300mb audio encoder from E4B or E2B
Create a new linear projection layer in Pytorch that translates the E4B encoder output to fit the hidden dimension size of the larger target model
Get a dataset of text and audio pairs
Freeze both the large model and audio encoder and only train the new linear projection layer

Since only the new layers have to be trained it should be relatively quick to train and wouldnt negatively affect the larger models output. Basically the same as this paper but instead of using the whisper encoder using the Gemma one which has been built for low latency LLMs.

submitted by /u/MaruluVR
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA