11.8 RMSProp

RMSProp is very similar to Adagrad. Both algorithms use the square of the gradient histories to scale the initial learning rate coefficients.
Adagrad accumulates all the past squared gradient in the state vector S_t (§11.7 eq. 11.7.5), which leads to a continuous growing S_t. In such design, the overall learning speed would continue to decrease over time. This property may not be ideal when dealing with a non-convex optimization problem.
Rather than considering the entire gradient history like Adagrad, RMSProp put more weights on recent gradients.
RMSProp adopts the leaky average when updating the state vector S_t, and introduces a hyperparameter γ to adjust the overall influences from past time steps.

S_t accumulates the squared gradient of the current and past steps (with decaying weights) to adjust the effective learning rate on a per-coordinate basis.
The leaky average computation here is similar to that of momentum (§11.6), yet with an extra normalization term (1-γ) in the front, in order to make each time step's contribution roughly summed up to 1 ( i.e. (1-γ) (1 + γ + γ^2 + r^3 + ... ) ≈ 1 ).
The coefficient γ controls how long the history is effective when adjusting the per-coordinate scale. The larger γ, the more past steps are effective in affecting the learning rate adjustment.
Implementation in Pytorch : torch.optim.RMSprop(params, lr, alpha).

Provide feedback