# Neural Networks¶

Faisal Qureshi
http://www.vclab.ca

## Lesson Plan¶

• Feed-forward neural networks
• Activation functions
• Gradient based learning in neural networks
• Backpropagation
• Layered view of neural networks

## Feed forward neural networks¶

• Approximate some function $y=f^{*}(\mathbf{x})$ by learning parameters $\theta$ s.t. $\tilde{y} = f(\mathbf{x}; \theta)$
• Feed forward neural networks can be seen as directed acyclic graphs $$y = f(\mathbf{x}) = f^{(3)}(f^{(2)}(f^{(1)}(\mathbf{x})))$$
• Training examples specify the output of the last layer
• Network needs to figure out the inputs/outputs for the hidden layers

## Extending linear models¶

How can we extend linear models?

• Specify a very general $\phi$ s.t. the model becomes $y=\theta^T \phi(\mathbf{x})$
• Problem with generalization
• Difficult to encode prior information needed to solve AI-level tasks
• Engineer $\phi$ for the task at hand
• Tedious
• Difficult to transfer to new tasks
• Neural networks approaches
• $y = f(\mathbf{x} ; \theta, w) = \phi(\mathbf{x}; \theta)^T w$ i.e. use parameters $\theta$ to learn $\phi$ and use $w$ to map $\phi(\mathbf{x})$ to the desired output $y$
• The training problem is non-convex
• Key advantage: a designer just need to specify the right family of functions and not the exact function $\phi$

## Classical artificial neural networks¶

• Shallow and wide
• One hidden layer can represent any function
• Focus was on efficient ways to optimize (train)

### Mathematically¶

We can represent the above network as:

$$y = f_o \left( \mathbf{W}_{oh} f_h \left( \mathbf{W}_{ih} \mathbf{x} \right) \right).$$

Here, $\mathbf{x} \in \mathbb{R}^d$ is the d-dimensional input, $\mathbf{W}_{ih} \in \mathbb{R}^{d_h \times d}$ is the input-to-hidden-layer weight matrix, $d_h$ is the size of the hidden layer, $\mathbf{W}_{ho} \in \mathbb{R}^{1 \times d_h}$ is the hidden-layer-to-output weight matrix, and $f_h$ and $f_o$ are activations functions for hidden layer and output, respectively. Activation functions take a vector as input and return a vector of the same size. The function is applied element-wise.

Traditionally, possible choices for $f_h$ are:

• hyperbolic tangent; and
• sigmoid.

Q. What if we use linear activation functions throughout?

### Activation functions¶

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
def sigmoid(x):
return 1 / (1 + np.exp(-x))

x = np.linspace(-5,5,100)

plt.figure(figsize=(10,5))
plt.suptitle('Activation functions')
plt.subplot(1,2,1)
plt.title('$tanh$')
plt.plot(x, np.tanh(x))
plt.xlabel('$x$')
plt.grid()
plt.subplot(1,2,2)
plt.title('$sigmoid$')
plt.xlabel('$x$')
plt.grid()
plt.plot(x, sigmoid(x));


### Example: a regression network¶

• Input: 1D, real numbers
• Ouput: 1D, real numbers
• Hidden layer size: 2
• Number of weights: 7
• Loss: MSE
• Statistics connection: line fitting

### Example: a classification network¶

• Input: 2D, real numbers
• Output: 1D, class labels 0 or 1
• Hidden layer size: 2
• Number of weights: 9
• Loss: Cross-entropy
• Statistics connect: probabilitist view of line fitting

## Deep neural networks¶

• Multi-layer networks
• These networks are deeper than these are wider
• Hierarchical representation
• Reduces semantic gap
• Deep networks outperform humans on many tasks
• Advances in computer science, physics and engineering

### Activation functions¶

• The activation function for output:
• Identity function for regression;
• Sigmoid for binary classification; and
• Softmax for multi-class classification.
• The following activation functions are typically used for hidden layers.
In [3]:
import torch
import torch.nn as nn
x = torch.linspace(-10,10,100)

plt.figure(figsize=(20,10))
plt.subplot(2,4,1)
plt.title('ReLU')
plt.grid()
plt.plot(x.numpy(), nn.ReLU()(x).numpy())
plt.subplot(2,4,2)
plt.title('Leaky ReLU')
plt.grid()
plt.plot(x.numpy(), nn.LeakyReLU(0.1)(x).numpy())
plt.subplot(2,4,3)
plt.title('ELU')
plt.grid()
plt.plot(x.numpy(), nn.ELU()(x).numpy())
plt.subplot(2,4,4)
plt.title('Hardswish')
plt.grid()
plt.plot(x.numpy(), nn.Hardswish()(x).numpy())
plt.subplot(2,4,5)
plt.title('GeLU')
plt.grid()
plt.plot(x.numpy(), nn.GELU()(x).numpy())
plt.subplot(2,4,6)
plt.title('Mish')
plt.grid()
plt.plot(x.numpy(), nn.Mish()(x).numpy())
plt.subplot(2,4,7)
plt.title('CELU')
plt.grid()
plt.plot(x.numpy(), nn.CELU()(x).numpy())
plt.subplot(2,4,8)
plt.title('ReLU6')
plt.grid()
plt.plot(x.numpy(), nn.ReLU6()(x).numpy());


## Gradient-based learning in neural networks¶

• Non-linearities of neural networks render most cost functions non-convex
• Use iterative gradient based optimizers to drive cost function to lower values
• Gradient descent applied to non-convex cost functions provides no guarantees, and it is sensitive to initial conditions
• Initialize weights to small random values
• Initialize biases to zero or small positive values

Q. How to compute the gradient of cost w.r.t. to network parameters.

• Backpropagation (Rumelhard et al. 1986) is often used to compute the gradient of the error function w.r.t. to network's weights.
• We will examine backpropagation within a layered view of neural networks
• Treating neural networks as differentiable layers allows a plug-and-play compositional ability enabling us to construct large neural networks

#### Example: Two-class softmax classifier¶

We can define the negative log likelihood for a two-class softmax classifier as

$$\small \begin{split} l(\theta) = - \sum_{i=1}^N \mathbb{I}_0 (y^{(i)}) \log \frac{e^{\mathbf{x}^{(i)^T} \theta_1}}{e^{\mathbf{x}^{(i)^T} \theta_1} + {e^{\mathbf{x}^{(i)^T} \theta_2}}} + \mathbb{I}_1 (y^{(i)}) \log \frac{e^{\mathbf{x}^{(i)^T} \theta_2}}{e^{\mathbf{x}^{(i)^T} \theta_1} + {e^{\mathbf{x}^{(i)^T} \theta_2}}} \end{split}.$$

We use the negative log likelihood as the cost $C(\theta)$ for this problem. We need to compute the gradient of this loss w.r.t. network paramters for gradient-based learning.

We can represent this network as layers as follows.

##### Chain rule¶
$$\frac{\partial f(g(u,v),h(u,v))}{\partial u} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial u} + \frac{\partial f}{\partial h} \frac{\partial h}{\partial u}$$

We can use the chain rule to compute $\frac{\partial z^4}{\partial \theta_1}$ and $\frac{\partial z^4}{\partial \theta_2}$.

$$\begin{eqnarray} \frac{\partial z^4}{\partial \theta_1} &=& \frac{\partial z^4}{\partial z_1^3} \frac{\partial z_1^3}{\partial \theta_1} + \frac{\partial z^4}{\partial z_2^3} \frac{\partial z_2^3}{\partial \theta_1} \\ &=& \frac{\partial z^4}{\partial z_1^3} \left( \frac{\partial z^3_1}{\partial z_1^2} \frac{\partial z_1^2}{\partial \theta_1} + \frac{\partial z^3_1}{\partial z_2^2} \frac{\partial z_2^2}{\partial \theta_1} \right) \\ &+& \frac{\partial z^4}{\partial z_2^3} \left( \frac{\partial z^3_2}{\partial z_1^2} \frac{\partial z_1^2}{\partial \theta_1} + \frac{\partial z^3_2}{\partial z_2^2} \frac{\partial z_2^2}{\partial \theta_1} \right) \end{eqnarray}$$

We can similarly compute $\frac{\partial z^4}{\partial \theta_2}$.

Recall that $z^4 = l(\theta)$, and we can minimize the $l(\theta)$ using gradient descent using the gradients computed above.

##### Forward pass¶
$$\begin{eqnarray} z^1 &=& f(x) \mathrm{\ \ (input data)}\\ z^2 &=& f(z^1) \mathrm{\ \ (linear function)}\\ z^3 &=& f(z^2) \mathrm{\ \ (log softmax)} \\ z^4 &=& f(z^3) = l(\theta) \mathrm{\ \ (negative log likelihood, cost)} \end{eqnarray}$$
##### Backward pass¶
$$\delta^l = \frac{\partial l(\theta)}{\partial z^L}$$
##### Computing $\delta^l$¶
$$\begin{split} \delta^4 &= \frac{\partial C({\theta})}{\partial z^4} = \frac{\partial z^4}{\partial z^4} = 1 \\ \delta^3_1 &= \frac{\partial C(\theta)}{\partial z^3_1} = \frac{\partial C(\theta)}{\partial z^4} \frac{\partial z^4}{\partial z^3_1} = \delta^4 \frac{\partial z^4}{\partial z^3_1} \\ \delta^3_2 &= \frac{\partial C(\theta)}{\partial z^3_2} = \frac{\partial C(\theta)}{\partial z^4} \frac{\partial z^4}{\partial z^3_2} = \delta^4 \frac{\partial z^4}{\partial z^3_2} \\ \delta^2_1 &= \frac{\partial C(\theta)}{\partial z^2_1} = \sum_k \frac{\partial C(\theta)}{\partial z^3_k} \frac{\partial z^3_k}{\partial z^2_1} = \sum_k \delta^3_k \frac{\partial z^3_k}{\partial z^2_1} \\ \delta^2_2 &= \frac{\partial C(\theta)}{\partial z^2_2} = \sum_k \frac{\partial C(\theta)}{\partial z^3_k} \frac{\partial z^3_k}{\partial z^2_2} = \sum_k \delta^3_k \frac{\partial z^3_k}{\partial z^2_2} \end{split}$$

#### For any differentiable layer $l$¶

For a given layer $l$, with inputs $z_i^l$ and outputs $z_k^{l+1}$ $$\delta^l_i = \sum_k \delta^{l+1}_k \frac{\partial z^{l+1}_k}{\partial z^l_i}$$

Similarly, for layer $l$ that depends upon parameters $\theta^l$, $$\frac{\partial C(\theta)}{\partial \theta^l} = \sum_k \frac{\partial C(\theta)}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial \theta^l} = \sum_k \delta^{l+1}_k \frac{\partial z^{l+1}_k}{\partial \theta^l}$$

In our 2-class softmax classifier only layer 1 has parameters ($\theta_0$ and $\theta_1$).

### Backpropagation¶

• Set $z^1$ equal to input $\mathbf{x}$.
• Forward pass: compute $z^2, z^3, ...$ layers $1, 2, ...$ activations.
• Set $\delta$ at the last layer equal to 1
• Backward pass: backpropagate $\delta$s all the way to first layer.
• Update $\theta$
• Repeat

## Layered architectures¶

• As long as we have differentiable layers, i.e., we can compute $\frac{\partial z^{l+1}_k}{\partial z^{l}_i}$, we can use backpropagation to update the parameters $\theta$ to minimize the cost $C(\theta)$.
• A neural network can itself be treated as a layer within another neural netowrk (recursion)
• This allows us to build new neural networks using exisitng (and sometimes pre-trained) models

## This is all good in theory, but what about in practice¶

### GPUs¶

• Support fast vectorized processing

### Autodiff¶

• Techniques to evaluate the derivative of a computer program
In [4]:
import torch
import numpy as np

def sigmoid(x):
return 1. / (1. + torch.exp(-x))

def derivative_of_sigmoid(x):
"Derivative of a sigmoid (analytical)"
return sigmoid(x) * (1 - sigmoid(x))

# input

# derivative of a sigmoid
dx = derivative_of_sigmoid(x)

# PyTorch program that implements sigmoid
z = sigmoid(x)

# using PyTorch autodiff to compute the derivative of the sigmoid
z_ = torch.sum(z)   # because backward can only be called on scalers
z_.backward()       # the backward pass

plt.figure(figsize=(8,8))
plt.title('Using PyTorch to compute the derivative of a sigmoid')
plt.plot(x.detach().numpy(), z.detach().numpy(), 'k', label='sigmoid')
plt.grid()
plt.plot(x.detach().numpy(), dx.detach().numpy(), 'b.', label='derivative computed analytically')
plt.plot(x.detach().numpy(), x.grad.detach().numpy(), 'r', label='derivative using autodiff')
plt.xlabel('x')
plt.legend();


### Implication¶

• It is easy to construct your own layers
• Provide ways to compute the derivative of the outputs w.r.t. the inputs
• Provide ways to compute the derivative of the outputs w.r.t. layer parameters
• Many layers do not have any parameters
• See above: using autodiff, it is possible to simply "program" a layer

## But what about model complexity?¶

• Do we need to perform regularization in neural networks?
• Yes
• Deep learning, especially, is applied to extremely complex tasks, consequently regularization is not as simple as controlling the number of parameters.
• Common techniques
• Parameter norm penalities
• Data augmentation
• Fake data
• Successful in object detection, classification, and segmentation
• Noise injection
• Applying (random) noise to inputs
• Applying (random) noise to hidden layers' inputs
• Noise added to network weights
• Recurrent neural networks
• A practical stochastic implementation of Bayesian inference over weights
• Archtecture engineering
• Dropout

## Summary¶

• Different ways to interpret a neural network
• Compositions of non-linear functions
• Computational graphs
• Comprised of differentiable layers
• Where possible compose new networks using existing networks