11.7 Adagrad

For the most basic mini-batch gradient descent, the learning rate η is simply a constant value.
Adagrad, which stands for Adaptive Gradient Optimizer, allows the effective learning rate to vary differently w.r.t. each individual parameter, iteration, and feature sparsity.

Note that the operations are applied coordinate / parameter wise.

Adagrad adaptively adjusts the learning rate according to the sum of the squares of the previous gradient steps.
For every iteration, the effective learning rate decreases dynamically on a per-coordinate basis, since the sum of the squared past gradients value increases along with iterations. So the effective learning rate gradually become small.
Adagrad is particularly effective for sparse features where the learning rate needs to decrease more slowly for infrequently occurring terms.
Implementation in Pytorch : torch.optim.Adagrad(params, lr).

Provide feedback