Deepseek v3 MHA

Key and Value Compression: The input for the $t$-th token at an attention layer is denoted as $\mathbf{h}_t \in \mathbb{R}^d$, where $d$ is the embedding dimension.
A compressed latent vector for keys and values, $\mathbf{c}_t^{KV} \in \mathbb{R}^{d_c}$, is computed using a down-projection matrix$W^{DKV} \in \mathbb{R}^{d_c \times d}$ : $\mathbf{c}_t^{KV} = W^{DKV} \mathbf{h}_t$
Here, $d_c (\ll d_h n_h)$ is the KV compression dimension, much smaller than the total dimension of keys and values.
Keys ($\mathbf{k}_t^C$) and values ($\mathbf{v}_t^C$) are reconstructed from $\mathbf{c}_t^{KV}$ using up-projection matrices $W^{UK}, W^{UV} \in \mathbb{R}^{d_h n_h \times d_c}$.
So $[\mathbf{k}{t,1}^C; \dots; \mathbf{k}{t,n_h}^C] = W^{UK} \mathbf{c}t^{KV}$, and $[\mathbf{v}{t,1}^C; \dots; \mathbf{v}_{t,n_h}^C] = W^{UV} \mathbf{c}_t^{KV}$
Rotary Positional Embedding (RoPE): A decoupled key vector carrying positional information ($\mathbf{k}t^R$) is generated using a separate projection matrix$W^{KR} \in \mathbb{R}^{d_h^R n_h \times d}$:$\mathbf{k}t^R = RoPE(W^{KR} \mathbf{h}t)$- The final key for each head combines the compressed key ($k{t,i}^C$) with the positional embedding ($k_t^R$):$k{t,i} = [k{t,i}^C; k_t^R]$
Query Compression: Queries are also compressed to reduce activation memory during training. A latent query vector$c_t^Q \in \mathbb{R}^{d_c’}$is computed:$c_t^Q = W^{DQ} h_t$- Queries are reconstructed similarly using up-projection matrices and positional embeddings.

Attention Output: The attention output for each head is computed using the standard attention mechanism: $o_{t,i} = \sum_{j=1}^{t} Softmax_j\left(\frac{\mathbf{q}{t,i}^\top\mathbf{k}{j,i}}{\sqrt{d_h + d_h^R}}\right) v_{j,i}$- The final output of the attention layer is obtained by concatenating all head outputs and applying an output projection matrix $W^O$:$u_t = W^O [o_{t,1}; o_{t,2}; …; o_{t,n_h}]$

w601sxs/Deepseek v3 MHA.md

Select an option

No results found

Select an option

No results found

w601sxs commented Dec 31, 2024

Uh oh!