- CNN: convolutional neural net: replace linear layers with convolutions
- GAN: generator and discriminator training in tandem
- SGD: stochastic gradient descent
- Supervised: training data is labeled (MNIST 0-9 digit labels)
- Unsupervised: training data is unlabeled (DCGAN bedrooms are just bedrooms)
- Parameters: learned weights and biases within NN layers
- Activations: transient computations calculated during a forward pass
- Probability distance function
-
Inuititions for KDL
- Personally think the gambling analogies clicked the most for me
- If the house is mistaken in its belief that a game follow probability distribution
$Q$ when you know it follows$P$ then$D_{KL}(P||Q)$ tells you how much money you could win using an optimal strategy to exploit this misbelief
-
KDL Intuitions video
- For 2 probability distributions
$P$ and$Q$ over the same space of outcomes$|X|=n$ measure the likelyhood of some event$E = x_1, x_2, x_3, \cdots, x_k$ under both distributions-
$P_P(E) = \prod p_i$ and$P_Q(E) = \prod q_i$ - Quantify the distance
$D(P||Q)$ from$P$ to$Q$ as the ratio of$P_P$ and$P_Q$ for any event - Normalize by raising this ratio to the
$1/k$ and logging it - Using log rules you and taking limit as
$k$ goes to infinity we reduce down to$\sum p_i\log(\frac{p_i}{q_i}) = D_{KL}(P||Q)$
-
- For 2 probability distributions
- Classifier models usually output a probability distributions (softmax of logits) so we need a way to measure the distance of this probaility ditstribution against the true distribution from the trianing data
- input
$x_i$ leads to prediction$P(y|x_i;\theta)$ (model parameters$\theta$ ) whereas real distribution is$P^*(y|x_i)$
- input
- Naturally you could use KL-Divergence to measure distance from
$P$ to$P^*$ - Using log rules you can separate the computation into a term that depends on
$\theta$ and one that only depends on the true training data distribution - The true distribution is fixed so we can focus on just minimizing the term that depends on theta
- This term is
$-\sum P^*(y|x_i)\log(P(y|x_i;\theta))$ which is the definition of cross entropy loss
- Using log rules you can separate the computation into a term that depends on
- In classifiers the true distribution is often all zeros with a single one in for the true label of the example
- In this case cross entropy loss reduces to maximum likelyhood estimator for this dataset
- Non-linear layers are bad at learning the identity function and its ilk
- Short circuiting a layer allows that layer to approximate the identity by learning to set its weights to near-zero to let the input bypass through the identity branch
- Empirically this performs much better than without residuals
- Using projections with parameter vs straight identity produces no significant improvement
- In ImageNet they use batch norms so there’s unlikely to be vanishing gradients (non-zero variance on forward step)
- As of paper publishing research is not quite sure why “plain nets” take so long to converge
- ResNets converge faster
- Belief is that identity shortcuts let optimizer solve some layers faster
- Bottlenecks
- Instead of short circuiting every pair of 3x3 conv create triplets of 1x1 conv reducing dim -> 3x3 conv -> 1x1 conv restoring dimension
- Short circuit the entire triplet
- Shows major improvement over bottlenecked “plain” nets
- Parametrized projections here hurt since they short circuit the higher dimensional space and thus introduce a lot of parameters
- Using bottle necks allows for much deeper networks with fewer params (1x1 conv much fewer params than 3x3)
- 50, 101, 151 layer ResNet perform way better
- At each layer the stddev of responses is lower than plain nets somewhat validating hypothesis that resNet non-linear functions tend to approx 0 to let the identity part take over
- At 1000 layers it still optimizes and works but performs slightly worse than 100 layers due to overfitting
- Technique also worked for general object classification outside of ImageNet/CIFAR
- Internal Covariate Shift: the change in the distribution of network activations due to the change in network parameters during training
- Using sigmoid activation it's bad to fall into the "saturated regime" (long tail) since gradient of activation goes to 0
- Abstract: normalizing for each training minibatch allows for higher learning rates and regularizes at least as good as dropout
- Training NN with mini-batches instead of on single examples is generally better
- More computationally efficient wiht parallel processing
- Mini-batches better represent the entire training set distribution (and it's gradient) than does a single example
- Historically we've known that "whitening" (i.e. scaling for mean 0 var 1) inputs leads to faster training of an entire network
- We can model each layer of a NN as a separate NN who input is the output of the previous layer
- Each SGD step across layers looks the same when you model it this way IF the distribution of the input does not change from layer to layer
- This is generally not true since the activation function will "saturate" in some dimensions of the input leading to vanishing gradients and slow convergence
- BatchNorm: core idea is to "whiten" inputs across layers not just at beginning of training
- When computing normalization factors independently of the gradient descent steps we find that the normalization adjustment can cancel out the gradient descent step adjustment leading to no reduction in loss
- Need to make the gradient descent be "aware" of the normalization
- Naively trying to account for this term is computationally expensive for statistics reason I don't feel like reading about
- Simulate full whitening between layers by scaling mean and var of each dimension independently
- Need to introduce transformation that allows normalization step to also be the identity function to effectively allow no normalization to take place
- Ex: normalizing inputs to sigmoid restricts sigmoid to linear section making the entire layer linear
- By-passing normalization can allow inputs to trigger non-linear zones of sigmoid if true non-linearity is needed
- Accomplished by introducing new scaling and shifting parameters
$\gamma$ and$\beta$ that can learn to undo the variance scaling and mean shifting done during normalization- Do the normalization across the mini batch across each dimension of input so these learned paramters can participate in the gradient descent step
- Add some
$\epsilon$ to the var to stabilize it if var is near 0
- Need to introduce transformation that allows normalization step to also be the identity function to effectively allow no normalization to take place
- TODO: only got to section 3.1
Random Aside: Dropout
- Deep NNs often overfit data and generalize poorly
- One solution is to use "ensemble" models that average the predictions across multiple models with different settings trained on the same data
- Approximate training multiple architectures in parallel by randomly "dropping" nodes during training
- Forces nodes to randomly take on different levels of responsibility on different inputs
- Without dropout sometimes later nodes can "co-adapt" to learn to correct mistakes from previous nodes
- Dropout also promotes learning sparse representations (useful for sparse autoencoders)
- Effectively thins network during trianing so need to start with a "wider" network
- Used for any layers
- For each layer specify hyperparamter for probability at which a node is kept (usually ~0.5 for hidden and 0.8-1 for visible)
- Need to rescale weights by drop out probability since weights will increase to compensate for less connectivity
- Can be done during training via "inverse dropout"
- Replaced older "regularization" techniques
- Generators generally more complex than discriminators
- Discriminators can draw lines to divide space but GANs must truly learn the distribution to create new points very close to real points
- Training convergence is generally unstable
- If generator gets really good then discriminator can do no better than 50/50 guess which means its feedback for generator is bad
- More training on bad feedback can degrade quality
- Conversely, early on in training the discriminator’s job is “easy” since generator outputs are horrible
- Naive generator loss
$log(1-D(G(z)) \sim log(1-0) = 0$ when$D$ is very good and$G$ is very bad - "Better" loss
$log(D(G(z))$ is monotonic in the opposite direction so maximizing this achieves the same result as minimizing the original loss except now when$D(G(z))$ is near zero early in training$log(D(G(z))$ is very large magnitutde
- Naive generator loss
- If generator gets really good then discriminator can do no better than 50/50 guess which means its feedback for generator is bad
- Wasserstein GANS:
- Let discriminator output any positive number instead of probability that sample is real/fake
- Goal is for discriminator to output generally larger numbers for real inputs than for fake inputs
- New loss
$D(x) - D(G(z))$ - No need to log since not probabilities
- Apparently leads to better training
- Requires weights to be “clipped” for theoretical math reasons
- Original Wasserstein GAN paper
- Failure modes:
- Vanishing gradient (discriminator wins)
- Fixes: Wasserstein, modified minimax (what we do in Streamlit)
- Mode collapse (generator only learns to produce 1 really good fake)
- Discriminator conversely learns to just reject that 1 sample
- Fixes: Wasserstein, unrolled GANs
- TODO unrolled paper
- Failure to converge
- Fixes: add noise to discriminator inputs, regularization
- TODO: regularization paper
- Fixes: add noise to discriminator inputs, regularization
- Vanishing gradient (discriminator wins)
- Variations
- Progressive GANs: early generator layers do low-res outputs to train more quickly
- Conditional GANs: specify which label the generator should try outputting
- Image to Image: input to generator is existing image and its goal is to make it look like ground truth image
- Loss is some combination of usual discriminator loss + pixel by pixel loss for diverging from ground truth
- Cycle Gan: generator learns to make input images look like images from ground truth set (horses to zebras) without labels
- Text-to-image: very limited
- Super resolution: learn to upscale an input image
- Inpainting: learn to generate blacked out parts of photo
- Text-to-speech
- Model architecture
- Replaced pooling techniques for down/up sampling with convolutions
- Replace fully connected pooling layers at start/end of conv layers with simple linear layers
- “Remove fully connected hidden layers for deeper architectures”
- Added batch norm to every conv layer
- Didn’t batch norm beginning/end of conv stack to avoid “sample oscillation and model instability”
- Replace old activation with ReLU
- Found LeakyReLU did well for discriminator
- TanH good for bounding output of generator
- Trained using SGD with Adam optimizer
- De-duped training set by training an auto-encoder on crops of the images to generate “semantic hashes”
- Used some openCV also to scrape faces off internet to use as more training data
- “Walk the latent space” to look for discontinuities
- Smoothly interpolating the input random noise vector should produce smooth changes in output images with new features emerging
- Inputs that maximize the outputs of different filter layers help visualize what the layer has “learned” to recognize
- With DCGAN trained on bedrooms many filters’ best input show distinctly “bed” and “window” like features
- For a bunch of samples manually draw bounding boxes around generated windows
- Train log-regression model on later generator conv layers (i.e. conv layers that input nearly full dimensional images) to predict if layer would’ve been activated on that bounding-box subspace
- Basically looking for layers that activate on subimages of windows
- Dropping layers that are predicted to activate on windows in the generator and feeding the same input showed outputs without windows
- Train log-regression model on later generator conv layers (i.e. conv layers that input nearly full dimensional images) to predict if layer would’ve been activated on that bounding-box subspace
- Averaging latent vectors that produce output images of the “same type” gives good representations of those features
- Smiling woman - neutral woman + neutral man = smiling man vector
- Can interpolate between representation vectors too
- Still has some model instability with mode collapse
- TODO
- TODO: go through streamlit again to jot down non-obvious stuff + read any papers I skipped
- https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
- https://lilianweng.github.io/posts/2018-08-12-vae/
- TODO
- infitesmal steps in gradient descent mathematically correct but computationally infeasible
- nice memory footprint since
$N$ parameters require$N$ gradients - "curvature" of gradient function ignored by partials in each direction
- can be addressed with precondition matrix but computing such matrix is also intractable
- large batch sizes feel intuitively correct (closer to true gradient across dataset)
- empirically can cause problems
- start with small batches and increase size over time
- powers of 2 good for recirsive computations, multiples of 32 good for CUDA
- Weight decay: at each iteration slightly scale weights down by
$.9999$ or something else close to 1- no mathematical proff it helps but emprically shows benefits
- type of inductive bias: helps us when it's justified and hurts us when it's not
- generally applied only to weights and not biases; also nto applied to BatchNorms
- Momentum: adjust learning rate by moving average of past gradients
- Pathological curves are hard to find global extrema on due to steep gradients toward local extrema in one dimension and shallow gradients toward the global extrema in another direction
- TODO: long list of tabs lol
- TODO
- General transformer architecture is as follows:
- tokens -> embeddings
- residual stream + attention heads
- residual stream + MLP layers
- embeddings -> tokens
- Residual Stream
- Lets transformer "remember" stuff from previous layers
- use logit lens to view predictions straight from stream before MLP layers and you'll see reasonable predictions
- Attention
- Generalizes convolution
- Attention head is learned algorithm for encoding "locality" of data within sequence
- Vision convolutions encode locality of pixel data but for text such locality is too nuanced and must be learned programmatically
- Generally broken up into 2 steps
- First learn sources and sinks for where information will move
- Function of source tokens and future tokens
- Linear transformation of input into "keys" (sources) and "queries" (destinations)
- Then learn what information to actually move
- Function of ONLY source tokens
- Linear transformation from inputs to "values"
- take average of values weighed by "key" for each "query"
- "Autoregressive" = don't let attention look ahead
- First learn sources and sinks for where information will move
- Generalizes convolution
- MLP
- Generally uses GELU activation function
- Doesn't move information between positions
- does "reasoning" on features extracted by attention layers
- nobody is quite sure how to intuit what is actually going on in MLP layers
- Can be thought of as Key-Value pairs memorizing what each feature means in the abstract output space
- TODO Memory paper
- "MLPs as memory management"
- Embeddings
- Sometimes we used "tied embeddings" to reuse the same embedding matrix to un-embed output vectors into tokens
- Efficient to do but not actually a principled approach
- Imagine transformer model was just a no-op
- Going from token->embedding->unembedding->token thus can be modeled by pure matrix multiplication
$W_E * W_U$ of the embedding and unembedding matrices - Bi-grams like "Barrack Obama" imply that the token "Barrack" is way more likely to be paired with the token "Obama" but not necessarily that "Obama" is followed by "Barrack"
- Thus this transformation is not symmetric but if
$W_U=W^T_E$ then their product would be symmetric
- Going from token->embedding->unembedding->token thus can be modeled by pure matrix multiplication
- In practice the residual stream nudges the transformer towards no-op direction meaning this bi-gram assymetry will emerge
- So our embedding and unembedding transformations cannot be the same
- Sometimes we used "tied embeddings" to reuse the same embedding matrix to un-embed output vectors into tokens
- LayerNorm
- BatchNorm analogy for transformers making each vector have mean 0 and var 1
- Now scale/translate vectors to desired mean/var
- Technically non-linear so it makes interprebility hard
- BatchNorm analogy for transformers making each vector have mean 0 and var 1
- Positional Embeddings
- Attention layers start out symmetric w.r.t. relative token index distances but really it should care more about closer tokens than farther ones
- Can be addressed by learning a lookup map from token index to residual vector to add back to the embedding before applying attention head
- Residual vector has "local" shared memory that can help nudge embedding to carry information about nearby tokens
- interpretability of transformers better done "vertically" across attention layers rather than within them
- Each attention head has 4 matrices
-
$Q$ and$K$ of sized_model x d_head(takes embeddings to "made up" head space)- For each index the "query" value tells you how much the "key" token at another index affects the prediction for the token at this index
- For each index
icompute a score based on the query and key of every other indexj(using dot product) that represents how the transformer should "pay attention" to token atjwhen predicting the token value ati
-
$V$ of sized_model x d_outand$O$ of sized_out x d_model(takes "made up" output space to embeddings)-
$V$ produces a single "value" of sized_outfor each token index in the sequence - For each index
iof the sequence$V$ then scales the value vector of that index by the score from$Q^TK$ -
$O$ converts that scaled value vector back into an embedding vector
-
- in practice
d_head = d_output
-
- Reformulate the full attention layer into
$Q^TK$ and$OV$ since dimensions line up- technically is inefficient since Q^TK is of size
d_head^2and$OV$ is of sized_model^2- both would be very sparse matrices since generally
d_head << d_model
- both would be very sparse matrices since generally
- Insert a softmax between
Q^TKandOV
- technically is inefficient since Q^TK is of size
- Full attention head of
$Q^T$ and$OV$ circuits can be thought of as executing 2 functions to generate data to append to each word of the sequence- inputs look like tuples of words along with their index in the sequence
(word, i) -
$Q^TK$ circuit thought of as a function that scores for how much "attention" a token at indexipays to a token at indexj- Signature
def QK((query, query_idx), (key, key_idx)) -> number
- Signature
-
$OV$ circuit is a function that takes a weigted averages how much attention indexipaid to each other token and spits out the token that represents that average- Signature def
OV((val, val_idx)) -> token - concat the token outputs of
$OV$ to the end of the initial inputs to get$(word, i, a_1)$ to get the output of the first attention layer- Further attention layers will have access to previous layers thorugh their inputs so the
$k$ -th layer takes as input$(word, i, a_1, \cdots, a_{k-1})$ and outputs$(word, i, a_1, \cdots, a_k)$
- Further attention layers will have access to previous layers thorugh their inputs so the
- Signature def
- I am using "token" and "embedding" and "word" interchangably here
- inputs look like tuples of words along with their index in the sequence
- Thinking of the
$Q^TK$ and$OV$ circuits as independent functiosn can help us understand what is happening across layers by imagining function composition - Attention head that shifts the entire sequence forward by 1 word is relatively simple
-
$Q^TK$ returns 1 if the key (dest) index is one step ahead of the query (source) index and 0 otherwise- Looks like identity matrix with diagonal offset by 1 to the left
-
$OV$ simply spits back out its input token- Scaling each token in our vocab by the scaling factor from
$Q^T$ sends every token to 0 except the token that immediately preceeded the current token which is scaled by 1 back to itself - Summing all of these 0-vectors with the vector for the token immedeately preceeding the current token gives an input to
$OV$ of the token that preceeded the current index
- Scaling each token in our vocab by the scaling factor from
-
- Induction heads recognize basic patterns and guess inductively that the next token follows the pattern
- ex:
a b c d a b c <guess>should guessd - This cannot be accomplished with a single attention layer since each
$Q^TK$ evaluation can only be aware of a single pairwise comparison between 2 indices- If I'm considering index 5 (i.e. token
b) and am computing$Q^TK$ against index 2 (tokenc) how could I possibly know whether or not this token matters? - I need to consider the whole sequence all at once to know if an arbitrary token was part of a repeating (inductive) substring and pairwise attention computation from
$Q^TK$ just aren't expressive enough for that
- If I'm considering index 5 (i.e. token
- ex:
- Possible to do induction with 2 layers
- layer 1:
-
$Q^TK$ : returns 1 if the key (dest) index is one step ahead of the query (source) index and 0 otherwise -
$OV$ : return the input token- this is the same as the "shift 1 forward" attention head from above
- The output for index
ilooks like an$(word, i, prev token)$
-
- layer 2:
-
$Q^T$ : return 1 if the key (dest) TOKEN is equal to the query (source) TOKEN and 0 otherwise -
$OV$ : return the output of the previous attention layer - In english composing these 2 attention layers does the following:
- layer 1: for every word
$w$ at index$i$ in my sequence remember what word came before me as$p_i$ - layer 2: for every word
$w$ at index$i$ in my sequence, if$w$ is equal to the previous word$p_j$ for some other word at index$j$ then return the word$p_j$ I remembered for index$j$ - This effectively "remembers" bigram pairs
- if at index
$k$ I remembered previous word$w_{k-1}$ then when I'm at index$i$ and notice the word$w_i$ is equal to that$w_{k-1}$ then I'll output whatever originally came at index$k$ again
- if at index
- layer 1: for every word
-
- layer 1:
- Multiple layers of attention can implement arbitrarily complex functions this way and interpretting what they do is at the heart of mech interp
- When an attention layer adds information to the residual stream that is used by the next layer in the "query" argument of
$Q^TK$ it's called Q-Composition- the information the first layer adds generally only pulled from the query argument
- Conversely if the information is used by reading from the "key" argument of
$Q^TK$ then it's called K-Composition - bigram induction head above uses K-composition since we read the augmented information off the the "key" argument
- can be done qith Q-composition by instead:
- Layer 1: for each index
$i$ if it's word$w_i$ is equal to another word$w_j$ at index$j$ then augment the stream with index$j$ (i.e. return $(word, i, j)$) - Layer 2: for each index
$i$ return the word at the index that comes after the index you remembered form the previous layer (i.e. return $(word, i, j, w_{j+1})$)
- Layer 1: for each index
- can be done qith Q-composition by instead:
- When an attention layer adds information to the residual stream that is used by the next layer in the "query" argument of
- V-Composition is a little weirder
- is a "virtual" phenomena where behavior's achieved by separate attention heads combine to reproduce the behavior attainable by a single attention head
- is useful for the network as a whole since it can build a small set of base "instructions" out of it's limited attention parameters and combine them build more complex programs
- analogous to CPU ISA tradeoff of instruction count vs complexity of programs
- In practice the circuits don't actually "extend" the residual stream by lengthening the tuple
- Instead just adding the circuit outputs directly to the vectors in the stream works
- The overall network is linear if you squint hard enough so you can still interpret each attention layer as acting linearly over their previous outputs
- something something superposition
- feels very quantum mechanics-y
- the residual stream is your quantum state which and attention heads update the quantum state by adjust its superposition over the basis of observable outcomes (sequence of embeddings)
- quantum operators are all linear so you can think of them moving around the superposition independently and adding up their independent movements