The Impact of Depth and Width on Transformer Language Model Generalization
[llm
transformer
deep-learning
width
depth
]
This is my reading note for The Impact of Depth and Width on Transformer Language Model Generalization. This paper shows that deeper transformer is necessary to have a good performance. Usually 4 to 6 layers is a good choice.
Introduction
We report three main conclusions:
- after fine-tuning, deeper models generalize better out-of-distribution than shallower models do, but the relative benefit of additional layers diminishes rapidly;
- within each family, deeper models show better language modeling performance, but returns are similarly diminishing;
- the benefits of depth for compositional generalization cannot be attributed solely to better performance on language modeling or on in-distribution data. (p. 1)
METHODOLOGY
CONSTRUCTING FAMILIES OF MODELS WITH EQUAL NUMBERS OF PARAMETERS
we can reduce the size of the feed-forward dimension π_ff, reduce the size of the residual stream (the embedding size) π_model, or reduce the size of the attention outputs π_attn (see Appendix B for a diagram of a transformer layer annotated with dimensionality labels). Vaswani et al. (2017) coupled these three variables at π_model = π_attn = π_ff/4. (p. 2)
RESULTS
LANGUAGE MODELING
- While deeper models do, in general, perform better than shallower ones, the increase in performance that comes from adding layers diminishes rapidly as models become deeper (Figure 3a). (p. 4)
- At the deeper end of our scale, adding layers is not only unhelpful for performance, but begins to harm it (see the right-hand sides of each size-class curve in Figure 3a). (p. 5)
- We find that smaller models are more sensitive to the particular value of the feed-forward ratio, and that for small models the standard ratio may not be optimal. This shows that larger models have more leeway to trade depth for width, becoming wider in proportion to their model dimension πmodel without incurring large penalties for their perplexity. It also shows that when πmodel/πff < 1 the feedforward ratio no longer serves as a predictor of relative perplexity independent of size. (p. 5)
COMPOSITIONAL GENERALIZATION
- On each of the datasets, deeper models tend to attain higher generalization accuracies than shallower models in the same size class. (p. 6)
- As with language modeling, most of the benefit of depth is gained by having only a few layers. This supports the hypothesis that the saturated effect of depth is due to the existence of easier subsets of the datasets, and shows that increasing depth alone does substantially improve the modelsβ ability to learn the correct inductive bias for these structural tasks (p. 6)
THE EFFECT OF DEPTH ON GENERALIZATION IS NOT SOLELY ATTRIBUTABLE TO BETTER PRETRAINING LOSS OR IN-DISTRIBUTION PERFORMANCE
- Both of these observations are potential confounds for the interpretation of the previous section: perhaps depth does not directly improve generalization accuracy, but only does so indirectly by allowing models to either become better LMs or else to better learn the in-distribution fine-tuning data (p. 7)
Written on October 31, 2023