On large learning rate in overparameterized models (Thoughts and directions)

Current deep models have billions of parameters and the scales are going to increase further. And it has been observed that these 'overparameterized' models operate on the threshold of stability (coined as edge of stability (EOS) in literature).

The current literature on optimization do not give us a clear direction on the behaviour of these models. Since, these analysis study gradient flow or analyze convergence of gradient descent on stable condition of step-size (h<2/L).

These necessitates on understanding how gradient descent behaves at the edge of stability. Here on, I plan to review the recent important papers studying this phenomenon (few of which I also had the opportunity to review at conferences) and share my opinions. On my current list of papers are which I will keep on doing as I find time

1) Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability by Alex Damian, Eshaan Nichani, Jason D. Lee. (link)

2) Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult by Yuqing Wang, Zhenghao Xu, Tuo Zhao, Molei Tao. (link)

3) Beyond the Edge of Stability via Two-step Gradient Updates by Lei Chen, Joan Bruna. ( link ) (my personal favorite)

4) Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability by Jingfeng Wu, Vladimir Braverman, Jason D. Lee (link)

More so, in future, I will also try to highlight through examples why studying EOS in linear models (2,3) do not reflect what happens in practical networks. Since, linear models do not exhibit catapaulting behaviour like non-linear models do.