Activation functions and empiricism
In deep learning literature, there’s a veritable menagerie of different activation functions that are commonly employed betwixt layers. Proponents of one or another class of functions will usually proffer up some rationalisation for what makes their choice grounded: differentiability, smoothness, computational complexity, numerical stability, concision…
Today I was reading through GLU Variants Improve Transformer by Noam Shazeer (also of Attention is all you need fame) and came across this gem of empiricism:
We have extended the GLU family of layers and proposed their use in Transformer. In a transfer-learning setup, the new variants seem to produce better perplexities for the de-noising objective used in pre-training, as well as better results on many downstream language-understanding tasks. These architectures are simple to implement, and have no apparent computational drawbacks. We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.
Well, I appreciate the honesty.