-
Key and Value Compression: The input for the
$t$ -th token at an attention layer is denoted as$\mathbf{h}_t \in \mathbb{R}^d$ , where$d$ is the embedding dimension. -
A compressed latent vector for keys and values,
$\mathbf{c}_t^{KV} \in \mathbb{R}^{d_c}$ , is computed using a down-projection matrix$W^{DKV} \in \mathbb{R}^{d_c \times d}$ :$\mathbf{c}_t^{KV} = W^{DKV} \mathbf{h}_t$ -
Here,
$d_c (\ll d_h n_h)$ is the KV compression dimension, much smaller than the total dimension of keys and values. -
Keys (
$\mathbf{k}_t^C$ ) and values ($\mathbf{v}_t^C$ ) are reconstructed from$\mathbf{c}_t^{KV}$ using up-projection matrices$W^{UK}, W^{UV} \in \mathbb{R}^{d_h n_h \times d_c}$ . -
So $[\mathbf{k}{t,1}^C; \dots; \mathbf{k}{t,n_h}^C] = W^{UK} \mathbf{c}t^{KV}$, and $[\mathbf{v}{t,1}^C; \dots; \mathbf{v}_{t,n_h}^C] = W^{UV} \mathbf{c}_t^{KV}$
-
Rotary Positional Embedding (RoPE): A decoupled key vector carrying positional information ($\mathbf{k}t^R$) is generated using a separate projection matrix$W^{KR} \in \mathbb{R}^{d_h^R n_h \times d}$:$\mathbf{k}t^R = RoPE(W^{KR} \mathbf{h}t)$- The final key for each head combines the compressed key ($k{t,i}^C$) with the positional embedding ($k_t^R$):$k{t,i} = [k{t,i}^C; k_t^R]$
-
Query Compression: Queries are also compressed to reduce activation memory during training. A latent query vector$c_t^Q \in \mathbb{R}^{d_c’}$is computed:$c_t^Q = W^{DQ} h_t$- Queries are reconstructed similarly using up-projection matrices and positional embeddings.
-
Attention Output: The attention output for each head is computed using the standard attention mechanism: $o_{t,i} = \sum_{j=1}^{t} Softmax_j\left(\frac{\mathbf{q}{t,i}^\top\mathbf{k}{j,i}}{\sqrt{d_h + d_h^R}}\right) v_{j,i}$- The final output of the attention layer is obtained by concatenating all head outputs and applying an output projection matrix
$W^O$ :$u_t = W^O [o_{t,1}; o_{t,2}; …; o_{t,n_h}]$