Faisal Qureshi
http://www.vclab.ca
Task: We want to classify the following image
import cv2
import matplotlib.pyplot as plt
img = cv2.imread('./convnet/1.jpeg', 0)
plt.figure(figsize=(10,8))
plt.imshow(img, cmap='gray')
plt.xlabel('width')
plt.xticks([])
plt.ylabel('height')
plt.yticks([])
plt.title(f'Image dimensions: {img.shape[0]} x {img.shape[1]}. #pixels: {img.shape[0] * img.shape[1]}');
2D convolution using kernel size 3, stride 1, and padding 1. Figure from towardsdatascience.com
Compute $\mathbf{f} \ast \mathbf{h}$ given
$$ \mathbf{f} = \left[ \begin{array}{cccccccc} 1 & 3 & 4 & 1 & 10 & 3 & 0 & 1 \end{array} \right] $$and
$$ \mathbf{h} = \left[ \begin{array}{ccc} 1 & 0 & -1 \end{array}\right] $$import numpy as np
f = np.array([1,3,4,1,10,3,0,1])
h = np.array([1,0,-1])
width = 1
result = np.ones(len(f)-2*width) # Array to store computed moving averages
for i in range(len(result)):
centre = i + width
result[i] = np.dot(h[::-1], f[centre-width:centre+width+1]) # Note the flip
print(f'signal f:\t\t{f}')
print(f'kernel h:\t\t{h}')
print(f'convolution (f*h):\t{result}')
We can define the convolution layer used in deep networks as follows
$$ \mathbf{z}_{i',j',f'} = b_{f'} + \sum_{i=1}^{H_f} \sum_{j=1}^{W_f} \sum_{f=1}^{F} \mathbf{x}_{i'+i-1,j'+j-1,f} \theta_{ijff'}. $$maxpooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks
Dilated convolution. 2D convolution using a 3 kernel with a dilation rate of 2 and no padding. Figure from towardsdatascience.com
import torchvision
import torchvision.models as m
import pprint as pp
print(f'torchvision\nVERSION: {torchvision.__version__}')
print('MODELS:')
pp.pprint([x for x in dir(m) if x[0] != '_'])
vgg16 = m.vgg16()
print(vgg16)
Going Deeper with Convolutions
Inception layer (simplified). Each conv is followed by a non-linear activation.
Deep Residual Learning for Image Recognition
Left: the VGG-19 model (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increase dimensions. Figure from He et al. 2016.
Figure taken from K. Derpanis notes on deep learning.
Densely Connected Convolutional Networks
Figure from Huang et al. 2017.
Figure from Huang et al. 2017
Figure from Huang et al. 2017
Squeeze-and-Excitation Networks
Figure from Hu et al. 2018
Squeeze-and-Excitation block (simplified).
Taken from Hu et al. 2018
FractalNet: Ultra-Deep Neural Networks without Residuals
SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size
MobileNet: Efficient Convolutional Neural Networks for Mobile Vision Applications
Aggregated Residual Transformation for Deep Neural Networks
Deep Pyramidal Residual Networks
Xception: Deep Learning with Depthwise Separable Convolutions
Attention Is All You Need
Exploring Self-attention for Image Recognition
End-to-End Object Detection with Transformers
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes
Learning Strides in Convolutional Neural Networks
MLP-Mixer: An all-MLP Architecture for Vision
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprising Well on ImageNet
ResMLP: Feedforward Networks for Image Classification with Data Efficient Training
A ConvNet for 2020s
Figure from Lie et al. 2020
Figure from Lie et al. 2020
Replace the ResNet-style stem cell with a patchify layer implemented using a $4 \times 4$, stride $4$ convolutional layer. The accuracy has changed from $79.4\%$ to $79.5\%$.
The stem cell in standard ResNet contains a $7 \times 7$ convolution layer with stride $2$, followed by a max pool, which results in a $4 \times$ downsampling of the input images.
The combination of depthwise conv and $1 \times 1$ convs leads to a separation of spatial and channel mixing, a property shared by vision Transformers, where each operation either mixes information across spatial or channel dimension, but not both.
One important design in every Transformer block is that it creates an inverted bottleneck, i.e., the hidden dimension of the MLP block is four times wider than the input dimension.
Figure from Lie et al. 2020
One of the most distinguishing aspects of vision Transformers is their non-local self-attention, which enables each layer to have a global receptive field.
To explore large kernels, one prerequisite is to move up the position of the depthwise conv layer.
Figure from Lie et al. 2020
ObjectNet: A Large-Scale Bias-Controlled Dataset for Pushing the Limits of Object Recognition Models`
Figure taken from objectnet.dev
The key to object recognition is representation.