The Transformer Family Version 2.0
Mirrored from Lil'Log (Lilian Weng) for archival readability. Support the source by reading on the original site.
Many new Transformer architecture improvements have been proposed since my last post on “The Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post — restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length.
Notations
| Symbol | Meaning |
|---|---|
| $d$ | The model size / hidden state dimension / positional encoding size. |
| $h$ | The number of heads in multi-head attention layer. |
| $L$ | The segment length of input sequence. |
| $N$ | The total number of attention layers in the model; not considering MoE. |
| $\mathbf{X} \in \mathbb{R}^{L \times d}$ | The input sequence where each element has been mapped into an embedding vector of shape $d$, same as the model size. |
| $\mathbf{W}^k \in \mathbb{R}^{d \times d_k}$ | The key weight matrix. |
| $\mathbf{W}^q \in \mathbb{R}^{d \times d_k}$ | The query weight matrix. |
| $\mathbf{W}^v \in \mathbb{R}^{d \times d_v}$ | The value weight matrix. Often we have $d_k = d_v = d$. |
| $\mathbf{W}^k_i, \mathbf{W}^q_i \in \mathbb{R}^{d \times d_k/h}; \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h}$ | The weight matrices per head. |
| $\mathbf{W}^o \in \mathbb{R}^{d_v \times d}$ | The output weight matrix. |
| $\mathbf{Q} = \mathbf{X}\mathbf{W}^q \in \mathbb{R}^{L \times d_k}$ | The query embedding inputs. |
| $\mathbf{K} = \mathbf{X}\mathbf{W}^k \in \mathbb{R}^{L \times d_k}$ | The key embedding inputs. |
| $\mathbf{V} = \mathbf{X}\mathbf{W}^v \in \mathbb{R}^{L \times d_v}$ | The value embedding inputs. |
| $\mathbf{q}_i, \mathbf{k}_i \in \mathbb{R}^{d_k}, \mathbf{v}_i \in \mathbb{R}^{d_v}$ | Row vectors in query, key, value matrices, $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$. |
| $S_i$ | A collection of key positions for the $i$-th query $\mathbf{q}_i$ to attend to. |
| $\mathbf{A} \in \mathbb{R}^{L \times L}$ | The self-attention matrix between a input sequence of lenght $L$ and itself. $\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$. |
| $a_{ij} \in \mathbf{A}$ | The scalar attention score between query $\mathbf{q}_i$ and key $\mathbf{k}_j$. |
| $\mathbf{P} \in \mathbb{R}^{L \times d}$ | position encoding matrix, where the $i$-th row $\mathbf{p}_i$ is the positional encoding for input $\mathbf{x}_i$. |
Transformer Basics
The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only BERT or decoder-only GPT.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.