RMSProp is very similar to Adagrad. Both algorithms use the square of the gradient histories to scale the initial learning rate coefficients.
Adagrad accumulates all the past squared gradient in the state vector St (§11.7 eq. 11.7.5), which leads to a continuous growing St. In such design, the overall learning speed would continue to decrease over time. This property may not be ideal when dealing with a non-convex optimization problem.
Rather than considering the entire gradient history like Adagrad, RMSProp put more weights on recent gradients.
RMSProp adopts the leaky average when updating the state vector St, and introduces a hyperparameter γ to adjust the overall influences from past time steps.
St accumulates the squared gradient of the current and past steps (with decaying weights) to adjust the effective learning rate on a per-coordinate basis.
The leaky average computation here is similar to that of momentum (§11.6), yet with an extra normalization term (1-γ) in the front, in order to make each time step's contribution roughly summed up to 1 ( i.e. (1-γ) (1 + γ + γ^2 + r^3 + ... ) ≈ 1 ).
The coefficient γ controls how long the history is effective when adjusting the per-coordinate scale. The larger γ, the more past steps are effective in affecting the learning rate adjustment.
Implementation in Pytorch :
torch.optim.RMSprop(params, lr, alpha)