Faisal Z Qureshi
faisal.qureshi@ontariotechu.ca
http://www.vclab.ca
These slides draw heavily upon works of many individuals, notably among them are:
$\newcommand{\bh}{\mathbf{h}}$
[From A. Karpathy Blog]
$\newcommand{\bx}{\mathbf{x}}$
Where
Subscript $t$ indicates sequence index.
When dealing with output sequences, we can define loss to be a function of the predicted output $\hat{y}_t$ and the expected value $y_t$ over a range of times $t$
$$ E(y,\hat{y}) = \sum_t E_t(y, \hat{y}) $$We need to compute $\frac{\partial E}{\partial W_{xh}}$, $\frac{\partial E}{\partial W_{hh}}$ and $\frac{\partial E}{\partial W_{hy}}$ in order to train an RNN
We can compute the highlighted term in the following expression using chain-rule $$ \frac{\partial E_3}{\partial W_{hh}} = \sum_{k=0}^3 \frac{\partial E_3}{\partial \hat{y}_3} \frac{\partial \hat{y}_3}{\partial \bh_3} \color{red}{\frac{\partial \bh_3}{\partial \bh_k}} \frac{\partial \bh_k}{\partial W_{hh}} $$ Applying the chain-rule $$ \frac{\partial \bh_3}{\partial \bh_k} = \prod_{j=k+1}^3 \frac{\partial \bh_j}{\partial \bh_{j-1}} $$ Or more generally $$ \frac{\partial \bh_t}{\partial \bh_k} = \prod_{j=k+1}^t \frac{\partial \bh_j}{\partial \bh_{j-1}} $$
$\frac{\partial \bh_i}{\partial \bh_{i-1}}$ is a Jacobian matrix.
For longer sequences
For the image captioning example shown in the previous slide, $\bh_t$ is defined as follows:
$$ \begin{split} \bh_t &= \tanh (W_{hh} \bh_{t-1} + W_{xh} \bx + \color{red}{W_{ih} \mathbf{v}}) \\ \hat{y}_t &= \text{softmax} (W_{hy} \bh_{t}) \end{split} $$Change of notation.
$$ \begin{split} c_t &= \theta c_{t-1} + \theta_g g_t \\ h_t &= \tanh(c_t) \end{split} $$$\newcommand{\bi}{\mathbf{i}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\bo}{\mathbf{o}}$ $\newcommand{\bo}{\mathbf{o}}$ $\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\boldf}{\mathbf{f}}$
$\circ$ represent element-wise multiplication
Check out the video at https://imgur.com/gallery/vaNahKE