Skip to content

Instantly share code, notes, and snippets.

@w601sxs
Last active December 31, 2024 20:13
Show Gist options
  • Select an option

  • Save w601sxs/bfd59d17d2de6de4a5ed7c760493c5e2 to your computer and use it in GitHub Desktop.

Select an option

Save w601sxs/bfd59d17d2de6de4a5ed7c760493c5e2 to your computer and use it in GitHub Desktop.
Deepseek v3 MHA
  • Key and Value Compression: The input for the $t$-th token at an attention layer is denoted as $\mathbf{h}_t \in \mathbb{R}^d$, where $d$ is the embedding dimension.

  • A compressed latent vector for keys and values, $\mathbf{c}_t^{KV} \in \mathbb{R}^{d_c}$, is computed using a down-projection matrix$W^{DKV} \in \mathbb{R}^{d_c \times d}$ : $\mathbf{c}_t^{KV} = W^{DKV} \mathbf{h}_t$

  • Here, $d_c (\ll d_h n_h)$ is the KV compression dimension, much smaller than the total dimension of keys and values.

  • Keys ($\mathbf{k}_t^C$) and values ($\mathbf{v}_t^C$) are reconstructed from $\mathbf{c}_t^{KV}$ using up-projection matrices $W^{UK}, W^{UV} \in \mathbb{R}^{d_h n_h \times d_c}$.

  • So $[\mathbf{k}{t,1}^C; \dots; \mathbf{k}{t,n_h}^C] = W^{UK} \mathbf{c}t^{KV}$, and $[\mathbf{v}{t,1}^C; \dots; \mathbf{v}_{t,n_h}^C] = W^{UV} \mathbf{c}_t^{KV}$

  • Rotary Positional Embedding (RoPE): A decoupled key vector carrying positional information ($\mathbf{k}t^R$) is generated using a separate projection matrix$W^{KR} \in \mathbb{R}^{d_h^R n_h \times d}$:$\mathbf{k}t^R = RoPE(W^{KR} \mathbf{h}t)$- The final key for each head combines the compressed key ($k{t,i}^C$) with the positional embedding ($k_t^R$):$k{t,i} = [k{t,i}^C; k_t^R]$

  • Query Compression: Queries are also compressed to reduce activation memory during training. A latent query vector$c_t^Q \in \mathbb{R}^{d_c’}$is computed:$c_t^Q = W^{DQ} h_t$- Queries are reconstructed similarly using up-projection matrices and positional embeddings.

  1. Attention Output: The attention output for each head is computed using the standard attention mechanism: $o_{t,i} = \sum_{j=1}^{t} Softmax_j\left(\frac{\mathbf{q}{t,i}^\top\mathbf{k}{j,i}}{\sqrt{d_h + d_h^R}}\right) v_{j,i}$- The final output of the attention layer is obtained by concatenating all head outputs and applying an output projection matrix $W^O$:$u_t = W^O [o_{t,1}; o_{t,2}; …; o_{t,n_h}]$
@w601sxs
Copy link
Author

w601sxs commented Dec 31, 2024

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment