r/MachineLearning · June 24, 2026 · 2 min read

High Dimensional, Dynamic Rotary Positional Embedding [P]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

High Dimensional, Dynamic Rotary Positional Embedding [P]

At the end of my last post, I presented an idea: what if I used the core of my last project, the cumulative matrix product, and repurposed it as a positional embedding?

I just finished fleshing out the math behind HDD-RoPE and training a model with this positional embedding algorithm, and the results are excellent. When trained on the dataset TinyStories, the validation loss begins to converge a fair amount faster than the baseline transformer trained using xPos.

A GPT-2-like model trained on TinyStories with hyperparameters copied from https://huggingface.co/roneneldan/TinyStories-33M (n_blocks=4, d_model=d_k=d_v=768)

The repo at https://github.com/mikayahlevi/hdd-rope/ allows you to replicate the results and goes in depth about the math and details of the architecture.

Standard RoPE breaks the queries and keys into groups of two and rotates each pair at a predefined rate. This allows the model to learn relative position by observing the change in basis between the queries and keys. Pairs of two make intuitive sense for a linear sequence, as a chunk can be rotated with a single degree of freedom, corresponding to linear one-dimensionally progressing position.
HDD-RoPE moves past this intuition and instead says that position within a sequence is multidimensional. Therefore, the chunks can be broken into any size, such as 4 as used in the TinyStories example. Four-dimensional chunks correspond to 4 choose 2 = 6 axes of rotation (6-dimensional position.) Essentially, we're saying that a token doesn't just lie at a position within the sequence, but a position within any construct the model can learn, such as a paragraph or sentence.
To facilitate this, I also make the amount of rotation along each axis data-dependent, such that it can learn how to advance the positions based on information stored in the current layer's activations.

If you would like to learn more, please check out the repo. I formalize the math and lay out a roadmap.

submitted by /u/mikayahlevi
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning