Faisal Qureshi
http://www.vclab.ca
How can we extend linear models?
We can represent the above network as:
$$ y = f_o \left( \mathbf{W}_{oh} f_h \left( \mathbf{W}_{ih} \mathbf{x} \right) \right). $$Here, $\mathbf{x} \in \mathbb{R}^d$ is the d-dimensional input, $\mathbf{W}_{ih} \in \mathbb{R}^{d_h \times d}$ is the input-to-hidden-layer weight matrix, $d_h$ is the size of the hidden layer, $\mathbf{W}_{ho} \in \mathbb{R}^{1 \times d_h}$ is the hidden-layer-to-output weight matrix, and $f_h$ and $f_o$ are activations functions for hidden layer and output, respectively. Activation functions take a vector as input and return a vector of the same size. The function is applied element-wise.
Traditionally, possible choices for $f_h$ are:
Q. What if we use linear activation functions throughout?
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
x = np.linspace(-5,5,100)
plt.figure(figsize=(10,5))
plt.suptitle('Activation functions')
plt.subplot(1,2,1)
plt.title('$tanh$')
plt.plot(x, np.tanh(x))
plt.xlabel('$x$')
plt.grid()
plt.subplot(1,2,2)
plt.title('$sigmoid$')
plt.xlabel('$x$')
plt.grid()
plt.plot(x, sigmoid(x));
import torch
import torch.nn as nn
x = torch.linspace(-10,10,100)
plt.figure(figsize=(20,10))
plt.subplot(2,4,1)
plt.title('ReLU')
plt.grid()
plt.plot(x.numpy(), nn.ReLU()(x).numpy())
plt.subplot(2,4,2)
plt.title('Leaky ReLU')
plt.grid()
plt.plot(x.numpy(), nn.LeakyReLU(0.1)(x).numpy())
plt.subplot(2,4,3)
plt.title('ELU')
plt.grid()
plt.plot(x.numpy(), nn.ELU()(x).numpy())
plt.subplot(2,4,4)
plt.title('Hardswish')
plt.grid()
plt.plot(x.numpy(), nn.Hardswish()(x).numpy())
plt.subplot(2,4,5)
plt.title('GeLU')
plt.grid()
plt.plot(x.numpy(), nn.GELU()(x).numpy())
plt.subplot(2,4,6)
plt.title('Mish')
plt.grid()
plt.plot(x.numpy(), nn.Mish()(x).numpy())
plt.subplot(2,4,7)
plt.title('CELU')
plt.grid()
plt.plot(x.numpy(), nn.CELU()(x).numpy())
plt.subplot(2,4,8)
plt.title('ReLU6')
plt.grid()
plt.plot(x.numpy(), nn.ReLU6()(x).numpy());
Q. How to compute the gradient of cost w.r.t. to network parameters.
We can define the negative log likelihood for a two-class softmax classifier as
$$ \small \begin{split} l(\theta) = - \sum_{i=1}^N \mathbb{I}_0 (y^{(i)}) \log \frac{e^{\mathbf{x}^{(i)^T} \theta_1}}{e^{\mathbf{x}^{(i)^T} \theta_1} + {e^{\mathbf{x}^{(i)^T} \theta_2}}} + \mathbb{I}_1 (y^{(i)}) \log \frac{e^{\mathbf{x}^{(i)^T} \theta_2}}{e^{\mathbf{x}^{(i)^T} \theta_1} + {e^{\mathbf{x}^{(i)^T} \theta_2}}} \end{split}. $$We use the negative log likelihood as the cost $C(\theta)$ for this problem. We need to compute the gradient of this loss w.r.t. network paramters for gradient-based learning.
We can represent this network as layers as follows.
We can use the chain rule to compute $\frac{\partial z^4}{\partial \theta_1}$ and $\frac{\partial z^4}{\partial \theta_2}$.
$$ \begin{eqnarray} \frac{\partial z^4}{\partial \theta_1} &=& \frac{\partial z^4}{\partial z_1^3} \frac{\partial z_1^3}{\partial \theta_1} + \frac{\partial z^4}{\partial z_2^3} \frac{\partial z_2^3}{\partial \theta_1} \\ &=& \frac{\partial z^4}{\partial z_1^3} \left( \frac{\partial z^3_1}{\partial z_1^2} \frac{\partial z_1^2}{\partial \theta_1} + \frac{\partial z^3_1}{\partial z_2^2} \frac{\partial z_2^2}{\partial \theta_1} \right) \\ &+& \frac{\partial z^4}{\partial z_2^3} \left( \frac{\partial z^3_2}{\partial z_1^2} \frac{\partial z_1^2}{\partial \theta_1} + \frac{\partial z^3_2}{\partial z_2^2} \frac{\partial z_2^2}{\partial \theta_1} \right) \end{eqnarray} $$We can similarly compute $\frac{\partial z^4}{\partial \theta_2}$.
Recall that $z^4 = l(\theta)$, and we can minimize the $l(\theta)$ using gradient descent using the gradients computed above.
For a given layer $l$, with inputs $z_i^l$ and outputs $z_k^{l+1}$ $$ \delta^l_i = \sum_k \delta^{l+1}_k \frac{\partial z^{l+1}_k}{\partial z^l_i} $$
Similarly, for layer $l$ that depends upon parameters $\theta^l$, $$ \frac{\partial C(\theta)}{\partial \theta^l} = \sum_k \frac{\partial C(\theta)}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial \theta^l} = \sum_k \delta^{l+1}_k \frac{\partial z^{l+1}_k}{\partial \theta^l} $$
In our 2-class softmax classifier only layer 1 has parameters ($\theta_0$ and $\theta_1$).
import torch
import numpy as np
def sigmoid(x):
return 1. / (1. + torch.exp(-x))
def derivative_of_sigmoid(x):
"Derivative of a sigmoid (analytical)"
return sigmoid(x) * (1 - sigmoid(x))
# input
x = torch.linspace(-10,10,100, requires_grad=True)
# derivative of a sigmoid
dx = derivative_of_sigmoid(x)
# PyTorch program that implements sigmoid
z = sigmoid(x)
# using PyTorch autodiff to compute the derivative of the sigmoid
z_ = torch.sum(z) # because backward can only be called on scalers
z_.backward() # the backward pass
plt.figure(figsize=(8,8))
plt.title('Using PyTorch to compute the derivative of a sigmoid')
plt.plot(x.detach().numpy(), z.detach().numpy(), 'k', label='sigmoid')
plt.grid()
plt.plot(x.detach().numpy(), dx.detach().numpy(), 'b.', label='derivative computed analytically')
plt.plot(x.detach().numpy(), x.grad.detach().numpy(), 'r', label='derivative using autodiff')
plt.xlabel('x')
plt.legend();