Adding E4B audio encoder to larger models
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I am curious if anyone here has tried doing this, I did a bit of digging and it seems like it would be easier to do then I first thought and would like to ask ask for correction if my assumptions are wrong. Here is how I would go about it:
- Extract the 300mb audio encoder from E4B or E2B
- Create a new linear projection layer in Pytorch that translates the E4B encoder output to fit the hidden dimension size of the larger target model
- Get a dataset of text and audio pairs
- Freeze both the large model and audio encoder and only train the new linear projection layer
Since only the new layers have to be trained it should be relatively quick to train and wouldnt negatively affect the larger models output. Basically the same as this paper but instead of using the whisper encoder using the Gemma one which has been built for low latency LLMs.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.