Skip to content

Instantly share code, notes, and snippets.

@firobeid
Last active November 23, 2025 16:15
Show Gist options
  • Select an option

  • Save firobeid/966d08d403250c9d4ac839fb8e0ca847 to your computer and use it in GitHub Desktop.

Select an option

Save firobeid/966d08d403250c9d4ac839fb8e0ca847 to your computer and use it in GitHub Desktop.

Chapter 1: The Entrance (Embeddings & Positional Encodings) Our story begins with raw text—the sentence "The car is blue"—entering the machine. To be understood by the model, these words must shed their human form. They are converted into Embeddings, transforming them into vector representations of meaning. However, meaning isn't enough; order matters. So, Positional Encodings are added to these vectors, giving each word a unique address in the sequence so the model knows that "The" comes before "car."

Chapter 2: The Three Personalities (Queries, Keys, and Values) As the vectors move deeper (0:00-0:02), they undergo a linear transformation. Each input vector splits into three distinct roles, known as the Query, the Key, and the Value.

  • The Query is the distinct representation of the word asking questions (e.g., "What am I related to?").
  • The Key is the label or identifier for every other word.
  • The Value is the actual content or substance of the word.

Chapter 3: The Great Conversation (Self-Attention Calculation) Now, the core magic happens: Self-Attention. The Queries of specific words perform a dot-product matrix multiplication with the Keys of all other words (0:03). This is the model asking, "How much does the word 'blue' relate to the word 'car'?" The resulting raw scores are passed through a Softmax function (0:04). This function normalizes the scores into probabilities between 0 and 1. These are now Attention Coefficients (or weights). If "blue" has a high attention score for "car," the model learns that these two concepts are tightly linked in this specific context.

Chapter 4: Gathering Wisdom (Pointwise Multiplication) With the attention scores established, the model performs Pointwise Multiplications (0:06). It multiplies the Attention Coefficients by the Values. This effectively filters the information: it amplifies the Values of relevant words (high attention) and drowns out the Values of irrelevant words. The sum of these weighted values creates a new context-aware vector.

Chapter 5: Parallel Perspectives (Multi-Head Attention) But looking at the sentence from just one angle isn't enough. As the animation reveals (0:07-0:08), this entire process happens simultaneously in multiple "universes" called Self-Attention Heads. Head 1 might focus on the relationship between the noun and adjective ("car" and "blue"), while Head 2 focuses on the verb ("is"). The outputs of these distinct heads are then brought together via Concatenation (0:08). They are merged and projected linearly to form the Multi-Head Self-Attention block (0:09). The words now possess a rich, multi-dimensional understanding of their context.

Chapter 6: Strengthening and Stabilizing (Add & Norm) The journey is arduous, so to prevent the signal from degrading, the model uses a Residual Connection (indicated by the arrows bypassing the block at 0:10). It adds the original input to the output of the attention block. This sum is then processed by Layer Normalization (Add & Normalize) to ensure the numbers remain stable and training is efficient.

Chapter 7: The Final Refinement (Feed Forward Network) Finally, the data passes through a Position-wise Feed-Forward Network (0:10-0:11). This comprises fully connected layers (weights and biases) that process each token independently to extract deeper features. Following one last Add & Normalize step (0:11), the transformation is complete. The simple inputs "The car is blue" have evolved into complex, context-aware vector representations, ready to be passed to the next layer or the decoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment