Grokking: Generalization beyond over-fitting on small algorithmic datasets
Grokking: Generalization Beyond Over-fitting...
,
is about generalization
of over-parameterized neural networks(NN) like the transformer.
The authors investigate the factors that help these massive NNs to generalize faster vs
others. The results are interesting. Let's dive in:
Results§
Row# | Study | Fixed | Visualizations From the Paper | Conclusion |
---|---|---|---|---|
1. | How validation accuracy changes with optimization steps given a fixes training set size |
| ![]() | - It takes more optimization steps for validation accuracy to reach the levels of training accuracy. BUT - it eventually gets there. The more interesting part is how it gets there which is clear from evidence in row 2. |
2. | How the loss changes over the steps of optimization. | Same as #1 | ![]() | The training and validation get to chance level accuracy around the same time. With more optimizing steps, the training accuracy decreases but validation error shoots up. Then between $10^5$ and $10^6$ magic really happens and validation errors start to go down and catches up with training error. |
3. | If the accuracy levels to reach are fixed, then how the percentage of training data determines the steps of optimization required | The validation accuracy - 99% | ![]() | Training time required to reach 99% validation accuracy increases rapidly as the training data fraction decreases (Read the graph along decreasing X-axis.) |
4. | How optimization methods affect generalization given we have fixed compute budget | The optimization step budget: $10^5$ steps. | ![]() | Optimization methods affect learning rate and generalization. The good effects of weight decay are pronounced. |
5. | Are there functions that are harder to generalize ? | The optimization step budget: $10^5$ steps. | ![]() | It takes larger percentages of training data to generalize functions that are un-symmetrical |
These are interesting observations. But in the light of the paper Are Emergent Abilities of Large Language Models a Mirage?, I wonder if the first and the second result would hold if the authors traded their non-linear metric (accuracy) for something else.
References§
- https://arxiv.org/pdf/2201.02177.pdf
- https://arxiv.org/pdf/2304.15004.pdf