One-Dimensional Gradient Descent
-
To first order, f(x+ϵ) can be approximated by the function value f(x) and the 1st derivative f'(x) at x.
-
Let ϵ = - η f'(x) for some small η > 0. Plug in Eq.(11.3.1), we have
If we use
x ← x - η f'(x) (11.3.4)
to iterate x, the value of function f(x) might decline :
f( x-η f'(x) ) ≈ f(x) - η [f'(x)]**2 ⪅ f(x) ( ∵ η [f'(x)]**2 ≥ 0 ) .
Multivariate Gradient Descent
- Learning rate matters.
- Adjusting the learning rate could be complicated, espicially in high dimensional space.
- How to pick a suitable learning rate? Consider information from the second-order derivative (curvature) of the objective function f.
∇f(x) is smooth → can use larger learning rate.
change of gradient function ∇f(x) is large → use smaller learning rate.
- For a deep learning network with millions of parameters, computing a Hessian matrix is too expensive O(dxd), with d being the number of parameters.