Gradient Descent¶

Faisal Qureshi
faisal.qureshi@ontariotechu.ca
http://www.vclab.ca

Lesson Plan¶

Minimizing loss and the need for numerical techniques
Gredient desent
- Recipe
- Update rule
Batch update
Mini-batch update
Stochastic (or online) gradient descent
Learning rate
- Changing learning rate to achieve faster convergence
Newton's method
- How to choose a step size?
Momentum

Regression¶

Consider data points $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots , (x^{(N)}, y^{(N)})$. Our goal is to learn a function $f(x)$ that returns (predict) the value $y$ given an $x$.

Linear regression¶

We assume that a linear model of the form $y=f(x)=\theta_0 + \theta_1 x$ best describe our data.

How do we determine the degree of "fit" of our model?

Least squares error¶

Loss/cost/objective function measures the degree of fit of a model to a given data.

A simple loss function is to sum the squared differences between the actual values $y^{(i)}$ and the predicted values $f(x^{(i)})$. This is called the least squares error.

$$ C(\theta_0, \theta_1) = \sum_{i=1}^N \left(y^{(i)} - f(x^{(i)}) \right)^2 $$

Our task is to find values for $\theta_0$ and $\theta_1$ (model parameters) to minimize the cost.

We often refer to the predicted value as $\hat{y}$. Specifically, $\hat{y}^{(i)} = f(x^{(i)})$.

$\DeclareMathOperator*{\argmin}{arg\,min}$

Minimizing cost¶

$$(\theta_0,\theta_1) = \argmin_{(\theta_0,\theta_1)} \sum_{i=1}^N \left(y^{(i)} - f(x^{(i)}) \right)^2$$

This is a convex function. We can solve for $\theta_0$ and $\theta_1$ by setting $\frac{\partial C}{\partial \theta_0}=0$ and $\frac{\partial C}{\partial \theta_1}=0$.

No description has been provided for this image

The need for numerical techniques¶

In general cost functions are not convex and it is not possible to find a minima (there are absolutely no guarantees about finding the global minima) using analytical methods

Gradient descent¶

A very powerful method of training the model parameters by minimizing the loss function. One of the simplest optimization methods. It is also referred to as steepest descent.

Recipe¶

Initialize model parameters randomly (in our case $\theta$)
Compute gradient of the loss function
Take a step in the direction of negative gradient (decreasing loss function), and update parameters
Repeat steps 2 to 4 until cannot decrease loss function anymore

Gradient¶

Let $\theta \in \mathbb{R}^D$ and $f(\theta)$ is a scalar-valued function. The gradient of $f(.)$ with respect to $\theta$ is:

$$ \nabla_{\theta} f(\theta) = \left[ \begin{array}{c} \frac{\partial f(\theta)}{\partial \theta_1} \\ \frac{\partial f(\theta)}{\partial \theta_2} \\ \frac{\partial f(\theta)}{\partial \theta_3} \\ \vdots \\ \frac{\partial f(\theta)}{\partial \theta_D} \end{array} \right] \in \mathbb{R}^D $$

Update rule¶

Gradient descent update rule is:

\begin{equation} \begin{split} \theta^{(k+1)} & = \theta^{(k)} - \eta \frac{\partial C}{\partial \theta} \\ & = \theta^{(k)} - \eta \nabla_{\theta} C \end{split} \end{equation}

$\eta$ is referred to as the learning rate. It controls the size of the step taken at each iteration.

Example: linear least squares fitting¶

For our 1D line fitting example, gradient descent update rule is $$\theta_0^{(k+1)} = \theta_0^{(k)} - \eta \frac{\partial C}{\partial \theta_0}$$ and $$\theta_1^{(k+1)} = \theta_1^{(k)} - \eta \frac{\partial C}{\partial \theta_1}$$

Where $$ \frac{\partial C}{\partial \theta_0} = -2 \sum_{i=1}^N \left(y^{(i)} - f(x^{(i)}) \right) \mathrm{ \ and\ } \frac{\partial C}{\partial \theta_1} = -2 \sum_{i=1}^N x^{(i)} \left(y^{(i)} - f(x^{(i)}) \right) $$

Fit¶

Batch update¶

Sum or average updates across every example, then change the parameter values

Example¶

\begin{align} \theta_0^{(k+1)} &= \theta_0^{(k)} + 2 \sum_{i=1}^N \left(y^{(i)} - f(x^{(i)}) \right) \\ \theta_1^{(k+1)} &= \theta_1^{(k)} + 2 \sum_{i=1}^N x^{(i)} \left(y^{(i)} - f(x^{(i)}) \right) \end{align}

Mini-batch update¶

Sum or average updates across a subset of the examples, then change the parameter values
Examples in each batch are selected at random

Example¶

\begin{align} \theta_0^{(k+1)} &= \theta_0^{(k)} + 2 \sum_{i=1}^{N_{\mathrm{batch}}} \left(y^{(i)} - f(x^{(i)}) \right) \\ \theta_1^{(k+1)} &= \theta_1^{(k)} + 2 \sum_{i=1}^{N_{\mathrm{batch}}} x^{(i)} \left(y^{(i)} - f(x^{(i)}) \right) \end{align}

Stochastic or online gradient desent¶

Update parameter values for each training example in turn
This assumes that sample is i.i.d. (independent, identically distributed)
This is particularly useful when dealing with very large datasets

Example¶

\begin{align} \theta_0^{(k+1)} &= \theta_0^{(k)} + 2 \left(y^{(i)} - f(x^{(i)}) \right) \\ \theta_1^{(k+1)} &= \theta_1^{(k)} + 2 x^{(i)} \left(y^{(i)} - f(x^{(i)}) \right) \end{align}

The effects of using a subset of data to compute loss¶

Learning rate¶

$$ \theta^{(k+1)} = \theta^{(k)} - \eta \nabla_{ \theta}{C} $$

$\eta$ too large, and the system might not converge
$\eta$ too small, and the system might take too long to converge

For gradient descent, $C(\theta)$, i.e., the objective, should decrease after every step.

A common scheme is stop iterations (i.e., declare convergence) if decrease in $C(\theta)$ is less than some threshold $\epsilon$ (say $10^{-3}$) in one iteration.

General rule of thumb: If gradient descent isn't working, use a smaller $\eta$.

Too small or too large a learning rate¶

Changing learning rate mid-training¶

Reducing learning rate¶

Zigzag behavior¶

Newton's method¶

We have looked at gradient descent, now we look at another method that can be used to estimate parameters $\theta$'s. This method converges much faster than gradient descent.

Problem: Given $f(\theta)$, find $\theta$ s.t. $f(\theta) = 0$

Observation¶

Gradient is the slope of the function.

$$ \begin{split} f'(\theta^{(0)}) &= \frac{f(\theta^{(0)})}{\Delta} \\ &= \frac{f(\theta^{(0)})}{\theta^{(1)} - \theta^{{0}}} \end{split} $$

Method¶

An iterative method to solve the above problem is:

Start with an initial value of $\theta$
Update $\theta$ as follows $$ \theta^{(1)} = \theta^{(0)} - \frac{f(\theta^{(0)})}{f'(\theta^{(0)})} $$
Repeat until $f(\theta) \approx 0$ or maximum number of iterations have reached.

Using Newton's method to maximize (or minimize a function)¶

We are interested in maximizing or minimize a function, e.g., we minimize the negative log likelihood to estimate the parameters.

Gradient is $0$ at maxima, minima and saddle points of a function. We can use the following update rule to maximize or minimize a function.

$$ \theta^{(1)} = \theta^{(0)} - \frac{C'(\theta^{(0)})}{C''(\theta^{(0)})} $$

Hessian¶

The Hessian matrix of $f(.)$ w.r.t. $\theta$, written $\nabla_{\theta}^2 f(\theta)$ or simply $\mathbf{H}$, is a $d \times d$ matrix of partial derviatives.

\begin{equation*} \mathbf{H} = \left[ \begin{array}{cccc} \frac{\partial^2 f(\theta)}{\partial \theta^2} & \frac{\partial^2 f(\theta)}{\partial \theta_1 \partial \theta_2} & \cdots & \frac{\partial^2 f(\theta)}{\partial \theta_1 \partial \theta_D} \\ \frac{\partial^2 f(\theta)}{\partial \theta_2 \theta_1} & \frac{\partial^2 f(\theta)}{\partial \theta_2^2} & \cdots & \frac{\partial^2 f(\theta)}{\partial \theta_2 \partial \theta_D} \\ \vdots & \vdots & & \vdots \\ \frac{\partial^2 f(\theta)}{\partial \theta_D \theta_1} & \frac{\partial^2 f(\theta)}{\partial \theta_D \partial \theta_2} & \cdots & \frac{\partial^2 f(\theta)}{\partial \theta_D^2} \\ \end{array} \right] \end{equation*}

How to choose the step size?¶

The most basic second-order optimization algorithm is Newton's algorithm.

$$ \theta^{(k+1)} = \theta^{(k)} - \mathbf{H}_{k}^{-1} \mathbf{g}_k $$

$\mathbf{H}_{k}$ is Hessian at step $k$

$\mathbf{g}_k$ gradient at step $k$

Momentum¶

$$\Delta \theta^{(k+1)} = \alpha \Delta \theta^{(k)} + (1 - \alpha) (- \eta) (\nabla C(\theta))$$

Why momentum really works (interactive)¶

https://distill.pub/2017/momentum/

Summary¶

Gradient descent
- Batch update
- Online or stochastic
Learning rate
Momentum
Hessian
These methods work well in practice
All of these methods converge as long as the learning rate is sufficiently small
The speed of convergence differs great, however
Newton's method converges quadratically
- Every iteration of Newton's method doubles the number of digits to which your solution is accurate, e.g., error goes from 0.01 to 0.0001 in one step
- This only holds when already close to the solution
$\theta$ is initialized randomly

In [ ]: