Neural Networks

Faisal Qureshi

Lesson Plan

  • Feed-forward neural networks
  • Activation functions
  • Gradient based learning in neural networks
    • Backpropagation
  • Layered view of neural networks

Feed forward neural networks

  • Approximate some function $y=f^{*}(\mathbf{x})$ by learning parameters $\theta$ s.t. $\tilde{y} = f(\mathbf{x}; \theta)$
  • Feed forward neural networks can be seen as directed acyclic graphs $$ y = f(\mathbf{x}) = f^{(3)}(f^{(2)}(f^{(1)}(\mathbf{x}))) $$
  • Training examples specify the output of the last layer
    • Network needs to figure out the inputs/outputs for the hidden layers

Extending linear models

How can we extend linear models?

  • Specify a very general $\phi$ s.t. the model becomes $y=\theta^T \phi(\mathbf{x})$
    • Problem with generalization
    • Difficult to encode prior information needed to solve AI-level tasks
  • Engineer $\phi$ for the task at hand
    • Tedious
    • Difficult to transfer to new tasks
  • Neural networks approaches
    • $y = f(\mathbf{x} ; \theta, w) = \phi(\mathbf{x}; \theta)^T w$ i.e. use parameters $\theta$ to learn $\phi$ and use $w$ to map $\phi(\mathbf{x})$ to the desired output $y$
    • The training problem is non-convex
    • Key advantage: a designer just need to specify the right family of functions and not the exact function $\phi$

Classical artificial neural networks

  • Shallow and wide
  • One hidden layer can represent any function
  • Focus was on efficient ways to optimize (train)


We can represent the above network as:

$$ y = f_o \left( \mathbf{W}_{oh} f_h \left( \mathbf{W}_{ih} \mathbf{x} \right) \right). $$

Here, $\mathbf{x} \in \mathbb{R}^d$ is the d-dimensional input, $\mathbf{W}_{ih} \in \mathbb{R}^{d_h \times d}$ is the input-to-hidden-layer weight matrix, $d_h$ is the size of the hidden layer, $\mathbf{W}_{ho} \in \mathbb{R}^{1 \times d_h}$ is the hidden-layer-to-output weight matrix, and $f_h$ and $f_o$ are activations functions for hidden layer and output, respectively. Activation functions take a vector as input and return a vector of the same size. The function is applied element-wise.

Traditionally, possible choices for $f_h$ are:

  • hyperbolic tangent; and
  • sigmoid.

Q. What if we use linear activation functions throughout?

Activation functions

In [1]:
import numpy as np
import matplotlib.pyplot as plt
In [2]:
def sigmoid(x):
  return 1 / (1 + np.exp(-x))

x = np.linspace(-5,5,100)

plt.suptitle('Activation functions')
plt.plot(x, np.tanh(x))
plt.plot(x, sigmoid(x));

Example: a regression network

  • Input: 1D, real numbers
  • Ouput: 1D, real numbers
  • Hidden layer size: 2
  • Number of weights: 7
  • Loss: MSE
  • Statistics connection: line fitting

Example: a classification network

  • Input: 2D, real numbers
  • Output: 1D, class labels 0 or 1
  • Hidden layer size: 2
  • Number of weights: 9
  • Loss: Cross-entropy
  • Statistics connect: probabilitist view of line fitting

Deep neural networks

  • Multi-layer networks
    • These networks are deeper than these are wider
  • Hierarchical representation
    • Reduces semantic gap
  • Deep networks outperform humans on many tasks
  • Access to data
  • Advances in computer science, physics and engineering

Activation functions

  • The activation function for output:
    • Identity function for regression;
    • Sigmoid for binary classification; and
    • Softmax for multi-class classification.
  • The following activation functions are typically used for hidden layers.
In [3]:
import torch
import torch.nn as nn
x = torch.linspace(-10,10,100)

plt.plot(x.numpy(), nn.ReLU()(x).numpy())
plt.title('Leaky ReLU')
plt.plot(x.numpy(), nn.LeakyReLU(0.1)(x).numpy())
plt.plot(x.numpy(), nn.ELU()(x).numpy())
plt.plot(x.numpy(), nn.Hardswish()(x).numpy())
plt.plot(x.numpy(), nn.GELU()(x).numpy())
plt.plot(x.numpy(), nn.Mish()(x).numpy())
plt.plot(x.numpy(), nn.CELU()(x).numpy())
plt.plot(x.numpy(), nn.ReLU6()(x).numpy());

Gradient-based learning in neural networks

  • Non-linearities of neural networks render most cost functions non-convex
  • Use iterative gradient based optimizers to drive cost function to lower values
  • Gradient descent applied to non-convex cost functions provides no guarantees, and it is sensitive to initial conditions
    • Initialize weights to small random values
    • Initialize biases to zero or small positive values

Q. How to compute the gradient of cost w.r.t. to network parameters.

Computing gradients

  • Backpropagation (Rumelhard et al. 1986) is often used to compute the gradient of the error function w.r.t. to network's weights.
  • We will examine backpropagation within a layered view of neural networks
    • Treating neural networks as differentiable layers allows a plug-and-play compositional ability enabling us to construct large neural networks

Example: Two-class softmax classifier

We can define the negative log likelihood for a two-class softmax classifier as

$$ \small \begin{split} l(\theta) = - \sum_{i=1}^N \mathbb{I}_0 (y^{(i)}) \log \frac{e^{\mathbf{x}^{(i)^T} \theta_1}}{e^{\mathbf{x}^{(i)^T} \theta_1} + {e^{\mathbf{x}^{(i)^T} \theta_2}}} + \mathbb{I}_1 (y^{(i)}) \log \frac{e^{\mathbf{x}^{(i)^T} \theta_2}}{e^{\mathbf{x}^{(i)^T} \theta_1} + {e^{\mathbf{x}^{(i)^T} \theta_2}}} \end{split}. $$

We use the negative log likelihood as the cost $C(\theta)$ for this problem. We need to compute the gradient of this loss w.r.t. network paramters for gradient-based learning.

We can represent this network as layers as follows.

Chain rule
$$ \frac{\partial f(g(u,v),h(u,v))}{\partial u} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial u} + \frac{\partial f}{\partial h} \frac{\partial h}{\partial u} $$

We can use the chain rule to compute $\frac{\partial z^4}{\partial \theta_1}$ and $\frac{\partial z^4}{\partial \theta_2}$.

$$ \begin{eqnarray} \frac{\partial z^4}{\partial \theta_1} &=& \frac{\partial z^4}{\partial z_1^3} \frac{\partial z_1^3}{\partial \theta_1} + \frac{\partial z^4}{\partial z_2^3} \frac{\partial z_2^3}{\partial \theta_1} \\ &=& \frac{\partial z^4}{\partial z_1^3} \left( \frac{\partial z^3_1}{\partial z_1^2} \frac{\partial z_1^2}{\partial \theta_1} + \frac{\partial z^3_1}{\partial z_2^2} \frac{\partial z_2^2}{\partial \theta_1} \right) \\ &+& \frac{\partial z^4}{\partial z_2^3} \left( \frac{\partial z^3_2}{\partial z_1^2} \frac{\partial z_1^2}{\partial \theta_1} + \frac{\partial z^3_2}{\partial z_2^2} \frac{\partial z_2^2}{\partial \theta_1} \right) \end{eqnarray} $$

We can similarly compute $\frac{\partial z^4}{\partial \theta_2}$.

Recall that $z^4 = l(\theta)$, and we can minimize the $l(\theta)$ using gradient descent using the gradients computed above.

Forward pass
$$ \begin{eqnarray} z^1 &=& f(x) \mathrm{\ \ (input data)}\\ z^2 &=& f(z^1) \mathrm{\ \ (linear function)}\\ z^3 &=& f(z^2) \mathrm{\ \ (log softmax)} \\ z^4 &=& f(z^3) = l(\theta) \mathrm{\ \ (negative log likelihood, cost)} \end{eqnarray} $$
Backward pass
$$ \delta^l = \frac{\partial l(\theta)}{\partial z^L} $$
Computing $\delta^l$
$$ \begin{split} \delta^4 &= \frac{\partial C({\theta})}{\partial z^4} = \frac{\partial z^4}{\partial z^4} = 1 \\ \delta^3_1 &= \frac{\partial C(\theta)}{\partial z^3_1} = \frac{\partial C(\theta)}{\partial z^4} \frac{\partial z^4}{\partial z^3_1} = \delta^4 \frac{\partial z^4}{\partial z^3_1} \\ \delta^3_2 &= \frac{\partial C(\theta)}{\partial z^3_2} = \frac{\partial C(\theta)}{\partial z^4} \frac{\partial z^4}{\partial z^3_2} = \delta^4 \frac{\partial z^4}{\partial z^3_2} \\ \delta^2_1 &= \frac{\partial C(\theta)}{\partial z^2_1} = \sum_k \frac{\partial C(\theta)}{\partial z^3_k} \frac{\partial z^3_k}{\partial z^2_1} = \sum_k \delta^3_k \frac{\partial z^3_k}{\partial z^2_1} \\ \delta^2_2 &= \frac{\partial C(\theta)}{\partial z^2_2} = \sum_k \frac{\partial C(\theta)}{\partial z^3_k} \frac{\partial z^3_k}{\partial z^2_2} = \sum_k \delta^3_k \frac{\partial z^3_k}{\partial z^2_2} \end{split} $$

For any differentiable layer $l$

For a given layer $l$, with inputs $z_i^l$ and outputs $z_k^{l+1}$ $$ \delta^l_i = \sum_k \delta^{l+1}_k \frac{\partial z^{l+1}_k}{\partial z^l_i} $$

Similarly, for layer $l$ that depends upon parameters $\theta^l$, $$ \frac{\partial C(\theta)}{\partial \theta^l} = \sum_k \frac{\partial C(\theta)}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial \theta^l} = \sum_k \delta^{l+1}_k \frac{\partial z^{l+1}_k}{\partial \theta^l} $$

In our 2-class softmax classifier only layer 1 has parameters ($\theta_0$ and $\theta_1$).


  • Set $z^1$ equal to input $\mathbf{x}$.
  • Forward pass: compute $z^2, z^3, ...$ layers $1, 2, ...$ activations.
  • Set $\delta$ at the last layer equal to 1
  • Backward pass: backpropagate $\delta$s all the way to first layer.
  • Update $\theta$
  • Repeat

Layered architectures

  • As long as we have differentiable layers, i.e., we can compute $\frac{\partial z^{l+1}_k}{\partial z^{l}_i}$, we can use backpropagation to update the parameters $\theta$ to minimize the cost $C(\theta)$.
  • A neural network can itself be treated as a layer within another neural netowrk (recursion)
    • This allows us to build new neural networks using exisitng (and sometimes pre-trained) models

This is all good in theory, but what about in practice


  • Support fast vectorized processing


  • Techniques to evaluate the derivative of a computer program
In [4]:
import torch
import numpy as np

def sigmoid(x):
    return 1. / (1. + torch.exp(-x))

def derivative_of_sigmoid(x):
    "Derivative of a sigmoid (analytical)"
    return sigmoid(x) * (1 - sigmoid(x)) 

# input
x = torch.linspace(-10,10,100, requires_grad=True)

# derivative of a sigmoid
dx = derivative_of_sigmoid(x)

# PyTorch program that implements sigmoid
z = sigmoid(x)

# using PyTorch autodiff to compute the derivative of the sigmoid
z_ = torch.sum(z)   # because backward can only be called on scalers
z_.backward()       # the backward pass

plt.title('Using PyTorch to compute the derivative of a sigmoid')
plt.plot(x.detach().numpy(), z.detach().numpy(), 'k', label='sigmoid')
plt.plot(x.detach().numpy(), dx.detach().numpy(), 'b.', label='derivative computed analytically')
plt.plot(x.detach().numpy(), x.grad.detach().numpy(), 'r', label='derivative using autodiff')


  • It is easy to construct your own layers
    • Provide ways to compute the derivative of the outputs w.r.t. the inputs
    • Provide ways to compute the derivative of the outputs w.r.t. layer parameters
      • Many layers do not have any parameters
  • See above: using autodiff, it is possible to simply "program" a layer
    • Derivatives come for free

But what about model complexity?

  • Do we need to perform regularization in neural networks?
    • Yes
  • Deep learning, especially, is applied to extremely complex tasks, consequently regularization is not as simple as controlling the number of parameters.
  • Common techniques
    • Parameter norm penalities
    • Data augmentation
      • Fake data
      • Successful in object detection, classification, and segmentation
    • Noise injection
      • Applying (random) noise to inputs
      • Applying (random) noise to hidden layers' inputs
      • Noise added to network weights
        • Recurrent neural networks
        • A practical stochastic implementation of Bayesian inference over weights
    • Archtecture engineering
      • Dropout


  • Different ways to interpret a neural network
    • Compositions of non-linear functions
    • Computational graphs
    • Comprised of differentiable layers
      • Where possible compose new networks using existing networks
  • Backpropagation: strategy for computing gradients for gradient-based learning
    • Use autodiff to automatixally compute gradients for each layer
      • Vast number of "deep learning" frameworks (e.g., TensorFlow, Theano, PyTorch, etc.); start with those first
  • Controlling model complexity
  • Deep learning
    • Loosely speaking, neural networks with several hidden layers
      • Convolutional layers: used for image processing
      • Fully connected layers: often used at the end for regression or classification
In [ ]: