Convolutional Networks

Faisal Qureshi
http://www.vclab.ca

Lesson Plan

  • Convolutional Networks
  • Convolution
  • Pooling layers
  • Dilated convolutions
  • Common architecture
    • GoogLeNet
    • ResNet
    • Densenet
    • Squeeze-and-Excitation network
  • ConvNext

Convolutional networks for computer vision tasks

Task: We want to classify the following image

In [1]:
import cv2 
import matplotlib.pyplot as plt

img = cv2.imread('./convnet/1.jpeg', 0)
plt.figure(figsize=(10,8))
plt.imshow(img, cmap='gray')
plt.xlabel('width')
plt.xticks([])
plt.ylabel('height')
plt.yticks([])
plt.title(f'Image dimensions: {img.shape[0]} x {img.shape[1]}.  #pixels: {img.shape[0] * img.shape[1]}');

Classical neural networks for computer vision tasks

  • Q. How many paramters per hidden layer unit?
  • Computational issues
  • Poor performance
    • The model is prone to overfitting
    • Model capacity issues

Convolutional neural network

  • David Hubel and Torsten Wiesel studied cat visual cortex and showed that visual information goes through a series of processing steps: 1) edge detection; 2) edge combination; 3) motion perception; etc. (Hubeland Wiesel, 1959)
    • Neurons are spatially localized
    • Topographic feature maps
    • Hierarchical feature processing
  • Convolutional layers achieve these properties
    • Each output unit is a linear function of a localized subset of input units
    • Same linear transformation is applied at each location
    • Local features detection is translation invariant
  • Convolutional layers provide architectural constraints
  • Number of parameters depend upon kernel sizes and not the size of the input
  • Inductive bias
    • Examples:
      • Architectural constraints
      • Image augmentation
      • Regularization

LeNet

Classifying digits (LeCun 1988)

  • The first few layers are convolution layers, and the last few layers are fully connected layers
    • Q. Why?
      • The convolutional layers are compute heavy, but have fewer parameters
      • The fully connected layer have far more parameters, but these are easy to compute

General idea

  • Generally speaking we can interpret convolutional deep networks as composed of two parts: 1) a (latent) feature extractor and 2) task head.
    • Feature extractor learns to construct powerful representations given an input. These representations are well-suited to the task at hand.

Convolution

  • In a nutshell: point-wise multiplication and sum
$$ (\mathbf{f} \ast \mathbf{k} )_i = \sum_{k \in [-w,w]} \mathbf{f}(i - k) \mathbf{h}(k) $$

2D convolution using kernel size 3, stride 1, and padding 1. Figure from towardsdatascience.com

Exercise: computing a 1D convolution (from scratch)

Compute $\mathbf{f} \ast \mathbf{h}$ given

$$ \mathbf{f} = \left[ \begin{array}{cccccccc} 1 & 3 & 4 & 1 & 10 & 3 & 0 & 1 \end{array} \right] $$

and

$$ \mathbf{h} = \left[ \begin{array}{ccc} 1 & 0 & -1 \end{array}\right] $$
In [2]:
import numpy as np

f = np.array([1,3,4,1,10,3,0,1])
h = np.array([1,0,-1])
width = 1
result = np.ones(len(f)-2*width)   # Array to store computed moving averages
for i in range(len(result)):
    centre = i + width
    result[i] = np.dot(h[::-1], f[centre-width:centre+width+1]) # Note the flip
print(f'signal f:\t\t{f}')
print(f'kernel h:\t\t{h}')
print(f'convolution (f*h):\t{result}')
signal f:		[ 1  3  4  1 10  3  0  1]
kernel h:		[ 1  0 -1]
convolution (f*h):	[  3.  -2.   6.   2. -10.  -2.]

Dealing with edges

  • Clipping
  • Replication
  • Symmetric padding
  • Circular padding

Convolutions as matrix-vector multiplication

  • Exercise: Describe 1D convolution as a matrix-vector multiplication.

Convolution layer

We can define the convolution layer used in deep networks as follows

$$ \mathbf{z}_{i',j',f'} = b_{f'} + \sum_{i=1}^{H_f} \sum_{j=1}^{W_f} \sum_{f=1}^{F} \mathbf{x}_{i'+i-1,j'+j-1,f} \theta_{ijff'}. $$

Pooling layers

  • Pooling layers are commonly used after convoluational layers
    • Decrease feature dimensions
    • Create some invariance to shifts
  • Average pooling $$ h_k(x,y,c) = \frac{1}{\mathcal{N}(x,y)} \sum_{(i,j) \in \mathcal{N}_(x,y)} h_{k-1}(i,j,c) $$
  • Maxpooling $$ \DeclareMathOperator*{\max}{max} h_k(x,y,c) = \max_{(i,j) \in \mathcal{N}(x,y)} h_{k-1}(i,j,c) $$
  • Springenberg et al. 2015 (ICLR workshops)

    maxpooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks

Other types of convolutions

Dilated convolution

  • Atrous convolutions

Dilated convolution. 2D convolution using a 3 kernel with a dilation rate of 2 and no padding. Figure from towardsdatascience.com

In [3]:
import torchvision
import torchvision.models as m
import pprint as pp
print(f'torchvision\nVERSION: {torchvision.__version__}')
print('MODELS:')
pp.pprint([x for x in dir(m) if x[0] != '_'])
torchvision
VERSION: 0.12.0
MODELS:
['AlexNet',
 'ConvNeXt',
 'DenseNet',
 'EfficientNet',
 'GoogLeNet',
 'GoogLeNetOutputs',
 'Inception3',
 'InceptionOutputs',
 'MNASNet',
 'MobileNetV2',
 'MobileNetV3',
 'RegNet',
 'ResNet',
 'ShuffleNetV2',
 'SqueezeNet',
 'VGG',
 'VisionTransformer',
 'alexnet',
 'convnext',
 'convnext_base',
 'convnext_large',
 'convnext_small',
 'convnext_tiny',
 'densenet',
 'densenet121',
 'densenet161',
 'densenet169',
 'densenet201',
 'detection',
 'efficientnet',
 'efficientnet_b0',
 'efficientnet_b1',
 'efficientnet_b2',
 'efficientnet_b3',
 'efficientnet_b4',
 'efficientnet_b5',
 'efficientnet_b6',
 'efficientnet_b7',
 'feature_extraction',
 'googlenet',
 'inception',
 'inception_v3',
 'mnasnet',
 'mnasnet0_5',
 'mnasnet0_75',
 'mnasnet1_0',
 'mnasnet1_3',
 'mobilenet',
 'mobilenet_v2',
 'mobilenet_v3_large',
 'mobilenet_v3_small',
 'mobilenetv2',
 'mobilenetv3',
 'optical_flow',
 'quantization',
 'regnet',
 'regnet_x_16gf',
 'regnet_x_1_6gf',
 'regnet_x_32gf',
 'regnet_x_3_2gf',
 'regnet_x_400mf',
 'regnet_x_800mf',
 'regnet_x_8gf',
 'regnet_y_128gf',
 'regnet_y_16gf',
 'regnet_y_1_6gf',
 'regnet_y_32gf',
 'regnet_y_3_2gf',
 'regnet_y_400mf',
 'regnet_y_800mf',
 'regnet_y_8gf',
 'resnet',
 'resnet101',
 'resnet152',
 'resnet18',
 'resnet34',
 'resnet50',
 'resnext101_32x8d',
 'resnext50_32x4d',
 'segmentation',
 'shufflenet_v2_x0_5',
 'shufflenet_v2_x1_0',
 'shufflenet_v2_x1_5',
 'shufflenet_v2_x2_0',
 'shufflenetv2',
 'squeezenet',
 'squeezenet1_0',
 'squeezenet1_1',
 'vgg',
 'vgg11',
 'vgg11_bn',
 'vgg13',
 'vgg13_bn',
 'vgg16',
 'vgg16_bn',
 'vgg19',
 'vgg19_bn',
 'video',
 'vision_transformer',
 'vit_b_16',
 'vit_b_32',
 'vit_l_16',
 'vit_l_32',
 'wide_resnet101_2',
 'wide_resnet50_2']
In [4]:
vgg16 = m.vgg16()
print(vgg16)
VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): ReLU(inplace=True)
    (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU(inplace=True)
    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU(inplace=True)
    (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): ReLU(inplace=True)
    (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): ReLU(inplace=True)
    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU(inplace=True)
    (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096, bias=True)
    (1): ReLU(inplace=True)
    (2): Dropout(p=0.5, inplace=False)
    (3): Linear(in_features=4096, out_features=4096, bias=True)
    (4): ReLU(inplace=True)
    (5): Dropout(p=0.5, inplace=False)
    (6): Linear(in_features=4096, out_features=1000, bias=True)
  )
)

Common architectures

GoogLeNet

  • Multiple feed-forward passes
  • Inception module
    • An inception module aims to approximate local sparse structure in a CNN by using filters of different sizes (within the same block) whose output is concatenated and passed on to the next stage

Inception layer

  • Acts as a bottleneck layer
  • 1-by-1 convolutional layers are used to reduce feature channels

Inception layer (simplified). Each conv is followed by a non-linear activation.

ResNet

Left: the VGG-19 model (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increase dimensions. Figure from He et al. 2016.

Residual unit

  • Pass through connections adds the input of a layer to its output
  • Deeper models are harder to train
  • Learn residual function rather than direct mapping
  • Notice the loss landscape with and without residual connections

Figure taken from K. Derpanis notes on deep learning.

Densenet

Figure from Huang et al. 2017.

  • Feature-maps learned by any of the layers can be accessed by all subsequent layers.
    • Encourages feature reuse throughout the network
    • Leads to more compact models
    • Supports diversified depth
  • Improved training
    • Individual layers get additional supervision from loss function through shorter (more direct) connections
      • Similar to DSN (Lee et al. 2015) that attach classifiers to each hidden layer forcing intermediate layers to learn discriminative features
    • Scale to hundreds of layers without any optimization difficulties

Dense blocks and transition layers

Figure from Huang et al. 2017

Densenet vs. Resnet

Figure from Huang et al. 2017

Squeeze-and-Excitation Networks

Figure from Hu et al. 2018

SE Block

  • Squeeze operator
    • Allows global information to be used when computing channel-wise weights
  • Excitation operator
    • Distribution across different classes is similar in early layers, suggesting that feature channels are "equally important" for different classes in early layers
    • Distribution becomes class-specific in deeper layers
  • SE blocks may be used for model prunning and network compression

Squeeze-and-Excitation block (simplified).

SE Performance

Taken from Hu et al. 2018

Spatial attention

  • SE computes channel weights; however, we can easily extend this idea to compute spatial weights to model some notion of spatial attention
    • The model will pay more attention to f

Other notable examples

  • Iandola et al., 2016

    SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size

  • Howard et al., 2017

    MobileNet: Efficient Convolutional Neural Networks for Mobile Vision Applications

  • Chollet, 2017

    Xception: Deep Learning with Depthwise Separable Convolutions

Attention-based Networks

Transformers (2017)

  • Transformers use attention-based computation.
  • These models are popular in Natural Language Processing community.
  • GPT3 language model also uses attention-based computation and it has roughly 175 billion parameters.
  • Zheng et al., 2020

    Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Learning convolution kernels

  • Observation
    • CNNs benefit from different kernel sizes at different layers
    • Exploring all possible combinations of kernel sizes is infeasible in practice
  • Romero et al. 2022

    FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

MLPs

  • Melas-Kyriazi 2021

    Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprising Well on ImageNet

  • Touvron et al. 2021

    ResMLP: Feedforward Networks for Image Classification with Data Efficient Training

A ConvNet for 2020

  • Results on ImageNet

    Figure from Lie et al. 2020

Modernizing a ConvNet towards Swin (Hierarchical Vision Transformer)

Figure from Lie et al. 2020

Key ideas

  • Change stem to Patchify

Replace the ResNet-style stem cell with a patchify layer implemented using a $4 \times 4$, stride $4$ convolutional layer. The accuracy has changed from $79.4\%$ to $79.5\%$.

The stem cell in standard ResNet contains a $7 \times 7$ convolution layer with stride $2$, followed by a max pool, which results in a $4 \times$ downsampling of the input images.

  • ResNeXtify
    • Grouped convolutions idea from Xie et al. 2016
      • Depthwise convolution where the number of groups equal to the number of channels. Similar to MobileNet and Xception.
        • Only mixes information in the spatial domain.

The combination of depthwise conv and $1 \times 1$ convs leads to a separation of spatial and channel mixing, a property shared by vision Transformers, where each operation either mixes information across spatial or channel dimension, but not both.

  • Inverted bottleneck

One important design in every Transformer block is that it creates an inverted bottleneck, i.e., the hidden dimension of the MLP block is four times wider than the input dimension.

Figure from Lie et al. 2020

  • Large kernel sizes

One of the most distinguishing aspects of vision Transformers is their non-local self-attention, which enables each layer to have a global receptive field.

To explore large kernels, one prerequisite is to move up the position of the depthwise conv layer.

  • Replacing ReLU with GELU
  • Use fewer activation functions
  • Use fewer normalization layers
    • Replace batch normalization with layer normalization

Figure from Lie et al. 2020

Aside: Normalization techniques

Wu and He, 2018

Figure from Wu and He 2018

Is object detection solved?

  • Barbu et al. 2019

    ObjectNet: A Large-Scale Bias-Controlled Dataset for Pushing the Limits of Object Recognition Models`

Figure taken from objectnet.dev

  • Performance on ObjectNet benchmark
    • 40 to 45% drop in performance

Afterward

  • Adapted from Jeff Hawkins, Founder of Palm Computing.

    The key to object recognition is representation.

  • Convolutional neural networks are particularly well-suited for computer vision tasks
  • Convolutional layers "mimic" processing in visual cortex
  • Exploits spatial relationship between neighbouring pixels
  • Learns powerful representations that reduce the semantic gap

Practical matters: where to go from here?

  • Deep learning is as much about engineering as it is about science
  • Learn one of deep learning frameworks
    • Become an efficient coder
  • Don't be afraid to use high-level deep learning tools to quickly prototype baselines (e.g., huggingface)
    • Deep learning projects share common features
      • Data loaders
      • Measuring performance, say accuracy, precision, etc.
In [ ]: