High Dimensional, Dynamic Rotary Positional Embedding [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
| At the end of my last post, I presented an idea: what if I used the core of my last project, the cumulative matrix product, and repurposed it as a positional embedding? I just finished fleshing out the math behind HDD-RoPE and training a model with this positional embedding algorithm, and the results are excellent. When trained on the dataset TinyStories, the validation loss begins to converge a fair amount faster than the baseline transformer trained using xPos. The repo at https://github.com/mikayahlevi/hdd-rope/ allows you to replicate the results and goes in depth about the math and details of the architecture. Standard RoPE breaks the queries and keys into groups of two and rotates each pair at a predefined rate. This allows the model to learn relative position by observing the change in basis between the queries and keys. Pairs of two make intuitive sense for a linear sequence, as a chunk can be rotated with a single degree of freedom, corresponding to linear one-dimensionally progressing position. If you would like to learn more, please check out the repo. I formalize the math and lay out a roadmap. [link] [comments] |
More from r/MachineLearning
-
Loss functions in Instance Representation Learning [R]
Jun 29
-
Price elasticity model [R]
Jun 29
-
Rejected MICCAI paper: workshop -> journal/conference or directly journal/conference [R]
Jun 29
-
I built a demo agricultural planning system with an AI advisor for small-scale farmers in Nicaragua using NASA data [p]
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.