Skip to content

Latest commit

 

History

History
39 lines (21 loc) · 2.07 KB

11.3_gradient-descent.md

File metadata and controls

39 lines (21 loc) · 2.07 KB

1. Intuition behind gradient descent

One-Dimensional Gradient Descent

  • To first order, f(x+ϵ) can be approximated by the function value f(x) and the 1st derivative f'(x) at x.

  • Let ϵ = - η f'(x) for some small η > 0. Plug in Eq.(11.3.1), we have

    If we use
        x ← x - η f'(x)                      (11.3.4)
    to iterate x, the value of function f(x) might decline :
       f( x-η f'(x) ) ≈ f(x) - η [f'(x)]**2 ⪅ f(x)     ( ∵ η [f'(x)]**2 ≥ 0 ) .

Multivariate Gradient Descent

2. The choice of Learning Rate

  • Learning rate matters.

  • Adjusting the learning rate could be complicated, espicially in high dimensional space.

Newton's Method

  • How to pick a suitable learning rate? Consider information from the second-order derivative (curvature) of the objective function f.
                   ∇f(x) is smooth → can use larger learning rate.
    change of gradient function ∇f(x) is large → use smaller learning rate.

  

  • For a deep learning network with millions of parameters, computing a Hessian matrix is too expensive O(dxd), with d being the number of parameters.

Preconditioning

  • Instead of computing the full Hessian matrix, one alternative approximation would be only computing its diagonal entries, and with the updating algorithm being :

  • Preconditioning with gradient descent has similar effect as selecting a different learning rate for each parameter dimension.