r/LocalLLaMA · · 2 min read

Tiny Scale Is All I Can Spare To Play With Transformer

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hi! I am a student from India, this is my first paper that I published.

I was curious whether I can combine both Attention and FFN together to save parameters without sacrificing performance, specifically at parameters <= 10M.

Basically my intuition was that Attention is dynamic and smart about which information to mix, but it has no strong non-linearity to actually transform that information. SwiGLU has the strong non-linearity but it's static. Same weights for every input. So instead of running both separately and wasting parameters, why not replace the static linear matrices in FFN with attention getting dynamic mixing and strong non-linearity in one unified operation.

I'm not treating this paper as any final conclusion of any means because I have a very very old hardware and Google Colab doesn't help either with scaling up cuz I don't have it's subscription. So I'm just treating this paper as an introduction of my idea and the experiments I was able to run on my given scale.

Before adding the abstract I'd also like you to know that just training the 0.8M params model took 8-10 hours on my PC (just a few minutes on Google Colab) and 4M model (which Google Colab wasn't letting me train) took around 3-4 days on my PC. That's the reason I didn't ran much experiments in the paper.

Abstract

Introduction of the Transformer neural network architecture in the famous Attention Is All You Need paper has created a huge wave of AI development in recent years. The scaled dot-product attention allows for information to be processed with higher efficiency and quality, which the previous RNN-based models lacked. However Transformer-based models comes with their own challenges, particularly with parameter efficiency for tiny models with parameters ≤ 5M. At such small scale a Transformer model essentially uses more parameter than it really should. This sub-ten-million parameters domain space is very underexplored and for good reasons but I wanted to explore it anyways. So here-in this paper I am introducing Silia, a novel transformer architecture designed for efficient modelling & classification tasks under severe parameter budget. Training against GPT-2 architecture (Andrej Karpathy's nanoGPT project) with same "base" hyperparameters, training data and compute budget, Silia achieves comparable loss and generation quality with significantly less parameters.

Thank you :)

submitted by /u/SrijSriv211
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA