Faisal Qureshi
faisal.qureshi@ontariotechu.ca
http://www.vclab.ca
Consider data points $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots , (x^{(N)}, y^{(N)})$. Our goal is to learn a function $f(x)$ that returns (predict) the value $y$ given an $x$.
We assume that a linear model of the form $y=f(x)=\theta_0 + \theta_1 x$ best describe our data.
How do we determine the degree of "fit" of our model?
Loss/cost/objective function measures the degree of fit of a model to a given data.
A simple loss function is to sum the squared differences between the actual values $y^{(i)}$ and the predicted values $f(x^{(i)})$. This is called the least squares error.
$$ C(\theta_0, \theta_1) = \sum_{i=1}^N \left(y^{(i)} - f(x^{(i)}) \right)^2 $$Our task is to find values for $\theta_0$ and $\theta_1$ (model parameters) to minimize the cost.
We often refer to the predicted value as $\hat{y}$. Specifically, $\hat{y}^{(i)} = f(x^{(i)})$.
$\DeclareMathOperator*{\argmin}{arg\,min}$
This is a convex function. We can solve for $\theta_0$ and $\theta_1$ by setting $\frac{\partial C}{\partial \theta_0}=0$ and $\frac{\partial C}{\partial \theta_1}=0$.
In general cost functions are not convex and it is not possible to find a minima (there are absolutely no guarantees about finding the global minima) using analytical methods
A very powerful method of training the model parameters by minimizing the loss function. One of the simplest optimization methods. It is also referred to as steepest descent.
Let $\theta \in \mathbb{R}^D$ and $f(\theta)$ is a scalar-valued function. The gradient of $f(.)$ with respect to $\theta$ is:
$$ \nabla_{\theta} f(\theta) = \left[ \begin{array}{c} \frac{\partial f(\theta)}{\partial \theta_1} \\ \frac{\partial f(\theta)}{\partial \theta_2} \\ \frac{\partial f(\theta)}{\partial \theta_3} \\ \vdots \\ \frac{\partial f(\theta)}{\partial \theta_D} \end{array} \right] \in \mathbb{R}^D $$Gradient descent update rule is:
\begin{equation} \begin{split} \theta^{(k+1)} & = \theta^{(k)} - \eta \frac{\partial C}{\partial \theta} \\ & = \theta^{(k)} - \eta \nabla_{\theta} C \end{split} \end{equation}$\eta$ is referred to as the learning rate. It controls the size of the step taken at each iteration.
For our 1D line fitting example, gradient descent update rule is $$\theta_0^{(k+1)} = \theta_0^{(k)} - \eta \frac{\partial C}{\partial \theta_0}$$ and $$\theta_1^{(k+1)} = \theta_1^{(k)} - \eta \frac{\partial C}{\partial \theta_1}$$
Where $$ \frac{\partial C}{\partial \theta_0} = -2 \sum_{i=1}^N \left(y^{(i)} - f(x^{(i)}) \right) \mathrm{ \ and\ } \frac{\partial C}{\partial \theta_1} = -2 \sum_{i=1}^N x^{(i)} \left(y^{(i)} - f(x^{(i)}) \right) $$
For gradient descent, $C(\theta)$, i.e., the objective, should decrease after every step.
A common scheme is stop iterations (i.e., declare convergence) if decrease in $C(\theta)$ is less than some threshold $\epsilon$ (say $10^{-3}$) in one iteration.
General rule of thumb: If gradient descent isn't working, use a smaller $\eta$.
We have looked at gradient descent, now we look at another method that can be used to estimate parameters $\theta$'s. This method converges much faster than gradient descent.
Problem: Given $f(\theta)$, find $\theta$ s.t. $f(\theta) = 0$
Gradient is the slope of the function.
$$ \begin{split} f'(\theta^{(0)}) &= \frac{f(\theta^{(0)})}{\Delta} \\ &= \frac{f(\theta^{(0)})}{\theta^{(1)} - \theta^{{0}}} \end{split} $$An iterative method to solve the above problem is:
We are interested in maximizing or minimize a function, e.g., we minimize the negative log likelihood to estimate the parameters.
Gradient is $0$ at maxima, minima and saddle points of a function. We can use the following update rule to maximize or minimize a function.
$$ \theta^{(1)} = \theta^{(0)} - \frac{C'(\theta^{(0)})}{C''(\theta^{(0)})} $$The Hessian matrix of $f(.)$ w.r.t. $\theta$, written $\nabla_{\theta}^2 f(\theta)$ or simply $\mathbf{H}$, is a $d \times d$ matrix of partial derviatives.
\begin{equation*} \mathbf{H} = \left[ \begin{array}{cccc} \frac{\partial^2 f(\theta)}{\partial \theta^2} & \frac{\partial^2 f(\theta)}{\partial \theta_1 \partial \theta_2} & \cdots & \frac{\partial^2 f(\theta)}{\partial \theta_1 \partial \theta_D} \\ \frac{\partial^2 f(\theta)}{\partial \theta_2 \theta_1} & \frac{\partial^2 f(\theta)}{\partial \theta_2^2} & \cdots & \frac{\partial^2 f(\theta)}{\partial \theta_2 \partial \theta_D} \\ \vdots & \vdots & & \vdots \\ \frac{\partial^2 f(\theta)}{\partial \theta_D \theta_1} & \frac{\partial^2 f(\theta)}{\partial \theta_D \partial \theta_2} & \cdots & \frac{\partial^2 f(\theta)}{\partial \theta_D^2} \\ \end{array} \right] \end{equation*}The most basic second-order optimization algorithm is Newton's algorithm.
$$ \theta^{(k+1)} = \theta^{(k)} - \mathbf{H}_{k}^{-1} \mathbf{g}_k $$$\mathbf{H}_{k}$ is Hessian at step $k$
$\mathbf{g}_k$ gradient at step $k$