SGD Behaviour

Understanding Gradient Descent on Edge of Stability in Deep Learning https://arxiv.org/abs/2205.09745

Per Zhiyuan Li,

If the learning rate (LR) is constant, there won’t be EoS and the x(t) will converge to 0 if LR < 2/(largest eigenvalue). In such case, x(t) will align to the direction along which the power iteration converges the slowest, that is, either the smallest eigenvector or the largest eigenvector depending on the relationship between LR and the eigenvalues. If LR > 2/(largest eigenvalue) then GD diverges.

Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability https://arxiv.org/abs/2209.15594

Rethinking SGD’s Noise https://francisbach.com/rethinking-sgd-noise/

Rethinking SGD’s noise – II: Implicit Bias https://francisbach.com/implicit-bias-sgd/