Adam (Adaptive Moment Estimation)


$$w_t = w_{t-1} - \eta \ \dfrac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$ **with:** $$\hat{m}_t = \dfrac{m_t}{1 - \beta^t_1}$$ $$\hat{v}_t = \dfrac{v_t}{1 - \beta^t_2}$$ this is for bias correction for the fact that first and second moment estimates start at zero.

given: $$m_t = \beta_1m_{t-1} + (1 - \beta_1)\ \delta w_t$$ $$v_t = \beta_2v_{t-1} + (1 - \beta_2)\ \delta w_t^2$$

  • $m_t$ is momentum.
  • $v_t$ takes the idea from AdaGrad / RMSProp

parameter:

  • $\eta$ is the learning rate
  • $\beta_1$ is forgeting param (typically 0.9)
  • $\beta_2$ is forgeting param (typically 0.99)

References

  1. https://youtu.be/_JB0AO7QxSA
  2. https://paperswithcode.com/method/adam
  3. https://optimization.cbe.cornell.edu/index.php?title=Adam