-
RMSProp is very similar to Adagrad. Both algorithms use the square of the gradient histories to scale the initial learning rate coefficients.
-
Adagrad accumulates all the past squared gradient in the state vector St (§11.7 eq. 11.7.5), which leads to a continuous growing St. In such design, the overall learning speed would continue to decrease over time. This property may not be ideal when dealing with a non-convex optimization problem.
-
Rather than considering the entire gradient history like Adagrad, RMSProp put more weights on recent gradients.
-
RMSProp adopts the leaky average when updating the state vector St, and introduces a hyperparameter γ to adjust the overall influences from past time steps.
-
St accumulates the squared gradient of the current and past steps (with decaying weights) to adjust the effective learning rate on a per-coordinate basis.
-
The leaky average computation here is similar to that of momentum (§11.6), yet with an extra normalization term (1-γ) in the front, in order to make each time step's contribution roughly summed up to 1 ( i.e. (1-γ) (1 + γ + γ^2 + r^3 + ... ) ≈ 1 ).
-
The coefficient γ controls how long the history is effective when adjusting the per-coordinate scale. The larger γ, the more past steps are effective in affecting the learning rate adjustment.
-
Implementation in Pytorch :
torch.optim.RMSprop(params, lr, alpha)
.