Andrey Gromov

Modern deep neural networks involve billions of parameters and appear to benefit greatly from scaling up. Many techniques have been discovered empirically that allow to train large models faster and better. I am interested in developing a phenomenological approach to deep learning that would allow to explain quantitatively why some popular practices are effective, and guide discovery and design of new architectures that can be trained better, faster, cheaper.

Critical initialization of wide and deep neural networks through partial Jacobians: general theory and applications to LayerNorm

AutoInit: Automatic Initialization via Jacobian Tuning

Deep learning is successful because the neural network can learn high quality features that allow generalization despite overparameterization without overfitting. It is important to have a simple model of how human-interpretible features are learnt and how feature learning is affected by various design choices.

Grokking -- spontaneous, delayed onset of generalization -- is a particularly simple setting where feature learning can be studied analytically.

Grokking modular arithmetic

Page updated

Google Sites

Report abuse