Kerushii · October 1, 2023 01:53
diff --git a/gistfile1.txt b/gistfile1.txt
 # Background
 A LLM is a state machine that aims to characterise and explain the data through the means of embedding them into a hyperspace; so knowledge, or in the topic of language modelling, the next token could be retrieved after the previous token. 

 In order to achieve this, in the field of language modelling, a LLM usually uses a loss function called “cross entropy loss” which essentially takes account of the possibility of the next token. The model is published for having confidence in the wrong token and is rewarded in having confidence in the correct token.

 Programmatically, this is done through an optimizer optimising the state machine on the loss landscape. The minimum modification that can be done by the machine is called “step length” and each optimization it applies towards the goal is called a step. 

 In order to make one step, one batch of data has to be seen. And the neural network finds the minimum in that batch. Ideally the network should be able to scan through the loss landscape and find the coordinate in the hyperspace which best reduces the perplexity, or best explains the data, however that’s often not possible due to computational constraints.

 In order to circumvent this limitation, the concept of batch size was invented. The idea is that given a sample size n, in the population m, the minima should be close to each other. This technique makes 2 assumptions: 1. The data is homogeneous(largely identical) and 2. Homogeneous data have the same local minimum. 

 However, in the above techniques, usually none of those are true. A batch size containing a record of 1 will not represent the total landscape of batch size of m and in the case of unseen data, during the backpropagation of the training, the network will make unnecessary optimizations moving towards a minima in a small batch size that don’t necessarily exist in a large batch.

 The above situation has resulted in the general observation that as sample size n approaches m: lim(n->m) the distance of local minima in the hyperspace should approach 0. Which means that a higher batch size would allow a network to learn faster, more efficiently and generalise better since it moves to the optimal local minimum faster in a given step.

 This observation can be tested empirically. 

 # Hypothesis: 

 A larger batch size model will have a lower loss, higher quality in a given amount of steps, and the end result will always be better


 # Specific Aim

 In order to test this above hypothesis, experiments can be set up to compare the performance and accuracies of 3 models trained using different batch sizes.

 Experiment variables need to be controlled. For example, dataset, model architecture and optimizer/step length; all of which are very easy to reproduce except for step length. Since the step length is calculated by considering batch size. The higher the batch size, the smaller the step length.

 # Implementation

 To achieve a controlled comparison, 3 experiments can be run. Note that all experiments were carried out using cosine warm up and the batch size here refers to the effective batch size: gpu count*bs per device*gradient accumulation steps

 ## 1. Superconvergence Test
 First a model using the smaller batch size needs to use a cosine warmup scheduler. A LR(not too high; but high enough to cause the network to become unstable) is to be set. The expectation is that when the network warms up and approaches the designated LR, it’s eventually going to destabilise. Thus it’s very important to have a long warm up phase in this experiment in order to find out when the network destabilised. A destabilisation can be characterised by rapid loss increase in the wild rose curve attached below. Note that this pattern will be different from overfitting. In overfitting the curve doesn’t start increasing in any given step, but one will observe an increasing trend overall in the validation curve. After this experiment, the LR x was recorded and the resultant models were discarded


 ## 2. Test for a model with a small batch size
 The smaller model is to be trained using a cosine warm up scheduler with an LR that’s below the unstable LR so it will not reproduce the collapse. After this experiment, the LR y= 50% x was used and recorded, model was saved for further testing


 ## 3. Test for a model with a large batch size
 The large batch sized model should have a LR z=y*(oldBS/newBS)^0.5 for the same step length as the smaller model. The rationale is not in the scope of this paper; it can be understood as a larger batch size will only have a LR of Z in order to achieve the same step length as the smaller batch size model. And even if the final value z is larger than the destabilisation value x in experiment 1, since the step length is equal between this model and model 2, this model(3) trained using the large batch size will not collapse. After this experiment, the LR z=y*(oldBS/newBS)^0.5 is recorded, model saved for further analysis

 # Conclusion
 The models were captured and loss curves would need to be compared.

diff --git a/gistfile2.txt b/gistfile2.txt
 https://media.discordapp.net/attachments/1112690728531918948/1155133341641736202/Screenshot_20230923-091406_Chrome.jpg?width=383&height=809
	# Background
	A LLM is a state machine that aims to characterise and explain the data through the means of embedding them into a hyperspace; so knowledge, or in the topic of language modelling, the next token could be retrieved after the previous token.

	In order to achieve this, in the field of language modelling, a LLM usually uses a loss function called “cross entropy loss” which essentially takes account of the possibility of the next token. The model is published for having confidence in the wrong token and is rewarded in having confidence in the correct token.

	Programmatically, this is done through an optimizer optimising the state machine on the loss landscape. The minimum modification that can be done by the machine is called “step length” and each optimization it applies towards the goal is called a step.

	In order to make one step, one batch of data has to be seen. And the neural network finds the minimum in that batch. Ideally the network should be able to scan through the loss landscape and find the coordinate in the hyperspace which best reduces the perplexity, or best explains the data, however that’s often not possible due to computational constraints.

	In order to circumvent this limitation, the concept of batch size was invented. The idea is that given a sample size n, in the population m, the minima should be close to each other. This technique makes 2 assumptions: 1. The data is homogeneous(largely identical) and 2. Homogeneous data have the same local minimum.

	However, in the above techniques, usually none of those are true. A batch size containing a record of 1 will not represent the total landscape of batch size of m and in the case of unseen data, during the backpropagation of the training, the network will make unnecessary optimizations moving towards a minima in a small batch size that don’t necessarily exist in a large batch.

	The above situation has resulted in the general observation that as sample size n approaches m: lim(n->m) the distance of local minima in the hyperspace should approach 0. Which means that a higher batch size would allow a network to learn faster, more efficiently and generalise better since it moves to the optimal local minimum faster in a given step.

	This observation can be tested empirically.

	# Hypothesis:

	A larger batch size model will have a lower loss, higher quality in a given amount of steps, and the end result will always be better


	# Specific Aim

	In order to test this above hypothesis, experiments can be set up to compare the performance and accuracies of 3 models trained using different batch sizes.

	Experiment variables need to be controlled. For example, dataset, model architecture and optimizer/step length; all of which are very easy to reproduce except for step length. Since the step length is calculated by considering batch size. The higher the batch size, the smaller the step length.

	# Implementation

	To achieve a controlled comparison, 3 experiments can be run. Note that all experiments were carried out using cosine warm up and the batch size here refers to the effective batch size: gpu countbs per devicegradient accumulation steps

	## 1. Superconvergence Test
	First a model using the smaller batch size needs to use a cosine warmup scheduler. A LR(not too high; but high enough to cause the network to become unstable) is to be set. The expectation is that when the network warms up and approaches the designated LR, it’s eventually going to destabilise. Thus it’s very important to have a long warm up phase in this experiment in order to find out when the network destabilised. A destabilisation can be characterised by rapid loss increase in the wild rose curve attached below. Note that this pattern will be different from overfitting. In overfitting the curve doesn’t start increasing in any given step, but one will observe an increasing trend overall in the validation curve. After this experiment, the LR x was recorded and the resultant models were discarded


	## 2. Test for a model with a small batch size
	The smaller model is to be trained using a cosine warm up scheduler with an LR that’s below the unstable LR so it will not reproduce the collapse. After this experiment, the LR y= 50% x was used and recorded, model was saved for further testing


	## 3. Test for a model with a large batch size
	The large batch sized model should have a LR z=y(oldBS/newBS)^0.5 for the same step length as the smaller model. The rationale is not in the scope of this paper; it can be understood as a larger batch size will only have a LR of Z in order to achieve the same step length as the smaller batch size model. And even if the final value z is larger than the destabilisation value x in experiment 1, since the step length is equal between this model and model 2, this model(3) trained using the large batch size will not collapse. After this experiment, the LR z=y(oldBS/newBS)^0.5 is recorded, model saved for further analysis

	# Conclusion
	The models were captured and loss curves would need to be compared.
No results found