Gradient descent algorithms underlie most of the optimization in Machine Learning. In this notebook we will go through some simple examples of how gradient descent is implemented, and the application of this to linear regression.
%% Cell type:markdown id: tags:
## Contents
* Vanilla Gradient Descent
* Application to Linear Regression
* Exercises
%% Cell type:markdown id: tags:
## Vanilla Gradient Descent
%% Cell type:markdown id: tags:
Consider that we have $n$ independent features, which are denoted by $x_1, \dots, x_n$, and a function $V(x_1,\dots,x_n)$ which we want to minimize. If we have have a function $W(x_1,\dots,x_n)$ that we want to maximize, then let $V = -W$ and minimizing $V$ is equivalent to maximizing $W$.
Since the name includes gradient, we first need to define the gradient. This is simply a vector which consists of all the first partial derivatives of the function $V$:
Recall that the partial derivative simply corresponds to the derivative with respect to that variable, with all other variables held constant.
The first observation that we can make about a function that we want to minimize is that the gradient of the function at a point is in the direction that the function increases fastest. Hence, in the opposite direction the function will **decrease** the fastest. Therefore the most sensible direction to minimize the function is in the opposite direction to the gradient, and consequently we can let
This forms the basis of the gradient descent method, and the parameter $\eta$ is called the learning rate.
The following short function implements gradient descent. It takes as input a function which calculates the gradient, a starting point, a learning rate, the number of iterations to take and the tolerance. Once the increments for the points decrease by less than the tolerance the algorithm returns.
To implement this method which uses as input the gradient of the quadratic function $V=x^2$. In this case the gradient is just the scalar $2x$. We do one iteration on each call to the gradient descent function, so as we can plot the evolution of the updates. In this case for a small learning rate, the solution converges monotonically to the minimum at $x=0$.
If we make the learning rate larger, the iterations now oscillate between either side of the function, and the convergence is much slower. If the learning rate is larger than 1 for this example, the iterations will diverge.
One problem that often occurs with gradient descent is that the solution can converge to a local minimum. For example, for the quartic function below we see that the solution converges to the local minimum at $x\approx 1$. Whereas for a learning rate of half the given value the solution will converge to the global minimum.