Grokking: Generalization beyond over-fitting on small algorithmic datasets

2024.03.25 [Updated: 2024.04.06] :: {ai} :: #llm #ai #learning #open-ai #Alethea Power #beginner

Table of Contents

Grokking: Generalization Beyond Over-fitting..., is about generalization of over-parameterized neural networks(NN) like the transformer. The authors investigate the factors that help these massive NNs to generalize faster vs others. The results are interesting. Let's dive in:

Results§

Row#	Study	Fixed	Conclusion
1.	How validation accuracy changes with optimization steps given a fixes training set size	Percentage of data used for training (50%) Learning a specific operation: binary operation of division mod 97	- It takes more optimization steps for validation accuracy to reach the levels of training accuracy. BUT - it eventually gets there. The more interesting part is how it gets there which is clear from evidence in row 2.
2.	How the loss changes over the steps of optimization.	Same as #1	The training and validation get to chance level accuracy around the same time. With more optimizing steps, the training accuracy decreases but validation error shoots up. Then between $10^5$ and $10^6$ magic really happens and validation errors start to go down and catches up with training error.
3.	If the accuracy levels to reach are fixed, then how the percentage of training data determines the steps of optimization required	The validation accuracy - 99%	Training time required to reach 99% validation accuracy increases rapidly as the training data fraction decreases (Read the graph along decreasing X-axis.)
4.	How optimization methods affect generalization given we have fixed compute budget	The optimization step budget: $10^5$ steps.	Optimization methods affect learning rate and generalization. The good effects of `weight decay` are pronounced.
5.	Are there functions that are harder to generalize ?	The optimization step budget: $10^5$ steps.	It takes larger percentages of training data to generalize functions that are un-symmetrical

These are interesting observations. But in the light of the paper Are Emergent Abilities of Large Language Models a Mirage?, I wonder if the first and the second result would hold if the authors traded their non-linear metric (accuracy) for something else.

References§

https://arxiv.org/pdf/2201.02177.pdf
https://arxiv.org/pdf/2304.15004.pdf