Lindholm et al, Chapter 6

Neural Networks and Deep Learnings

The Neural Network Model

θ^=fθ(x1,,xp)(6.1)\hat\theta=f_\theta(x_1, \dots, x_p)\tag{6.1}

Generalised Linear Regression

y^=W1x1+W2x2++Wpxp+b(6.2)\hat y = W_1 x_1 + W_2 x_2 + \cdots + W_p x_p + b \tag{6.2}
Figure 1 - (a) Graphical illustration of linear regression model and (b) generalised linear regression model. In (a), the output is the sum of all terms, b and the weighted inputs. In (b), the output is the sum of all terms, b and the weighted inputs passed through a non-linear activation function h
y^=h(W1x1+W2x2++Wpxp+b)(6.3)\hat y = h(W_1 x_1 + W_2 x_2 + \cdots + W_p x_p + b) \tag{6.3}

Two-Layer Network

qk=h(Wk1x1+Wk2x2++Wkpxp+bk),k=1,,U(6.4) q_k = h(W_{k1} x_1 + W_{k2} x_2 + \cdots + W_{kp} x_p + b_k), \quad\quad k=1,\dots,U \tag{6.4}
y^=W1q1+W2q2+WUqU+b(6.5) \hat y = W_1 q_1 + W_2 q_2 + \cdots W_U q_U + b \tag{6.5}
q1=h(W11(1)x1+W12(1)x2+W1 p(1)+b1(1)),q2=h(W21(1)x1+W22(1)x2+W2 p(1)+b2(1)),qU=h(WU1(1)x1+WU2(1)x2+WU p(1)+bU(1)),(6.6a) \begin{align*} q_1 &= h(W_{11}^{(1)} x_1 + W_{12}^{(1)} x_2 + \cdots W_{1\ p}^{(1)} + b_1^{(1)}),\\ q_2 &= h(W_{21}^{(1)} x_1 + W_{22}^{(1)} x_2 + \cdots W_{2\ p}^{(1)} + b_2^{(1)}),\\ \vdots\\ q_U&=h(W_{U1}^{(1)} x_1 + W_{U2}^{(1)} x_2 + \cdots W_{U\ p}^{(1)} + b_U^{(1)}),\\ \end{align*} \tag{6.6a}
y^=W1(2)q1+W2(2)q2++WU(2)+b(2)(6.6b)\hat{y}=W_1^{(2)}q_1 + W_2^{(2)}q_2 + \cdots + W_U^{(2)} + b^{(2)} \tag{6.6b}
W(1)=[W11(1)W1p(1)WU1(1)WUp(1)],    b(1)=[b1(1)bU(1)],W(2)=[W1(2)WU(2)],b(2)=[b(2)],(6.7)\begin{align*} {\bf{W}}^{(1)} &= \begin{bmatrix} W_{11}^{(1)} & \cdots & W_{1p}^{(1)}\\ \vdots & & \vdots \\ W_{U1}^{(1)} & \cdots & W_{U p}^{(1)}\\ \end{bmatrix}, \ \ \ \ &&{\bf{b}}^{(1)}= \begin{bmatrix} b_1^{(1)}\\ \vdots\\ b_U^{(1)} \end{bmatrix}, \\ W^{(2)} &= \begin{bmatrix} W_1^{(2)} & \cdots & W_U^{(2)} \end{bmatrix}, &&{\bf{b}}^{(2)} = \begin{bmatrix} b^{(2)} \end{bmatrix}, \end{align*} \tag{6.7}
q=h(W(1)x+b(1)),y^=W(2)q+b(2), \begin{align} &{\bf{q}} = h({\bf{W}}^{(1)} x + {\bf{b}}^{(1)}), \tag{6.8a} \\ &\hat{y}={\bf{W}}^{(2)} {\bf{q}} + {\bf{b}}^{(2)}, \tag{6.8b} \end{align}
θ=[vec(W(1))Tvec(b(1))Tvec(W(2))Tvec(b(2))T] \def\vec{\text{vec}} \def\W {{\bf{W}}} \def\b {{\bf{b}}} \theta=\begin{bmatrix} \vec(\W^{(1)})^T & \vec(\b^{(1)})^T & \vec(\W^{(2)})^T & \vec(\b^{(2)})^T \end{bmatrix}

Deep Neural Networks

q(l)=h(W(l)q(l1)+b(l))(6.10) {\bf{q}}^{(l)} = h({\bf{W}}^{(l)} {\bf{q}}^{(l-1)} + {\bf{b}}^{(l)}) \tag{6.10}
q(1)=h(W(1)x+b(1)),q(2)=h(W(2)q(1)+b(2)),   q(L1)=h(W(L1)q(L2)+b(L1)),y^=W(L)q(L1)+b(L)(6.11) \def\q{{\bf q}} \def\W{{\bf W}} \def\b{{\bf b}} \begin{align*} \q^{(1)} &= h(\W^{(1)} x + \b^{(1)}), \\ \q^{(2)} &= h(\W^{(2)} \q^{(1)} + \b^{(2)}),\\ &\ \ \ \vdots\\ \q^{(L-1)} &= h(\W^{(L-1)} \q^{(L-2)} + \b^{(L-1)}),\\ \end{align*} \tag{6.11}\\ \hat y = \W^{(L)} \q^{(L-1)} + \b^{(L)}

Neural Networks for Classification

softmax(z)=1j=1Mezj[ez1ez2ezM](6.15)\text{softmax}({\bf z})\overset\triangle=\frac{1}{\sum_{j=1}^M e^{z_j}}\begin{bmatrix}e^{z_1}\\ e^{z_2} \\ \vdots \\ e^{z_M}\end{bmatrix} \tag{6.15}

Training a Neural Network

θ=[vec(W(1))Tvec(b(1))Tvec(W(2))Tvec(b(2))T] \def\vec{\text{vec}} \def\W {{\bf{W}}} \def\b {{\bf{b}}} \theta=\begin{bmatrix} \vec(\W^{(1)})^T & \vec(\b^{(1)})^T & \vec(\W^{(2)})^T & \vec(\b^{(2)})^T \end{bmatrix}
θ^=argminθJ(θ), whereJ(θ)=1ni=1nL(xi,yi,θ)(6.18)\hat\theta=\arg\min_\theta J(\theta), \quad\quad\text{ where} J(\theta)=\frac 1n \sum_{i=1}^n L({\bf x}_i, {\bf y}_i, \theta)\tag{6.18}

Backpropagation

dW(t)=W(l)J(θ)=[J(θ)W11(l)J(θ)W1,U(l1)(l)J(θ)WU(l),1(1)J(θ)WU(l),U(l1)](6.22a) \def\W{{\bf{W}}} dW^{(t)} \overset{\triangle}{=} \nabla_{\partial\W^{(l)}} J(\theta) = \begin{bmatrix} \frac{\partial J(\theta)}{\partial\W_{11}^{(l)}} & \cdots & \frac{\partial J(\theta)}{\partial\W_{1,U^{(l-1)}}^{(l)}}\\ \vdots && \vdots \\ \frac{\partial J(\theta)}{\partial \W_{U^{(l)},1}^{(1)}}& \cdots & \frac{\partial J(\theta)}{\partial \W_{U^{(l)},U^{(l-1)}}} \end{bmatrix} \tag{6.22a}
db(l)=b(l)J(θ)=[J(θ)b1b1(l)J(θ)b1bU(l)(l)] d{\bf{b}}^{(l)} \overset{\triangle}{=} \nabla_{\partial {\bf{b}}^(l)} J(\theta)=\begin{bmatrix} \frac{\partial J(\theta){\partial b_1}}{\partial b_1^{(l)}}\\ \vdots\\ \frac{\partial J(\theta){\partial b_1}}{\partial b_{U^{(l)}}^{(l)}}\\ \end{bmatrix}
Wt+1(l)Wt(l)γdWt(l),(6.23a) \def\W{{\bf{W}}} \def\b{{\bf{b}}} W_{t+1}^{(l)} \leftarrow \W_t^{(l)} - \gamma d\W_t^{(l)},\tag{6.23a}
bt+1(l)bt(l)γdbt(l),(6.23b) \def\W{{\bf{W}}} \def\b{{\bf{b}}} b_{t+1}^{(l)} \leftarrow \b_t^{(l)} - \gamma d\b_t^{(l)},\tag{6.23b}
dq(l)q(l)J(θ)=[J(θ)q1(l)J(θ)qU(l)(l)](6.25b) d{\bf q}^{(l)} \triangleq \nabla_{{\bf{q}}^{(l)}} J(\theta)= \begin{bmatrix} \frac{\partial J(\theta)}{\partial q_1^{(l)}}\\ \vdots\\ \frac{\partial J(\theta)}{\partial q_{U^{(l)}}^{(l)}} \end{bmatrix} \tag{6.25b}
dz(l)=dq(l)h(z(l))(6.27a) d{\bf{z}}^{(l)} = d{\bf q}^{(l)} \odot h' ({\bf z}^{(l)}) \tag{6.27a}
dq(t1)=W(t)Td(z(l))(6.27a) d{\bf{q}}^{(t-1)} = {\bf W}^{(t)T} d({\bf z}^{(l)}) \tag{6.27a}

Figure 5 - Graphical representation of the backpropagation algorithm

Input:

  1. Forward-Propagation
  2. Set Q0X{\bf{Q}}^0\leftarrow \bf{X}
  3. for l=1,,Ll=1,\cdots, L do
  4.  |   Z(l)=Q(l)T+b(l)T{\bf{Z}}^{(l)} = {\bf{Q}}^{(l)T} + {\bf{b}}^{(l)T}
  5.  |   Q(l)=h(Z(l)){\bf{Q}}^{(l)} = h({\bf{Z}}^{(l)})        Do not execute this line for the last layer l=Ll=L
  6. end
  7. Evaluate the cost function
  8. if Regression problem then
  9.  |   J(θ)=1nbi=1nb(yiZi(L))2J(\theta)=\frac{1}{n_b} \sum_{i=1}^{n_b} (y_i-Z_i^{(L)})^2
  10.  |   dZ(L)=2(yZ(L))d{\bf{Z}^{(L)}}=-2({\bf{y}}-{\bf{Z}}^{(L)})
  11. else if Classification problem then
  12.  |   J(θ)=1nbi=1nb(Zi,yi(L)+ln(j=1Mexp(Zij(L))))J(\theta) = \frac{1}{n_b} \sum_{i=1}^{n_b} (-Z_{i,y_i}^{(L)} + \ln (\sum_{j=1}^{M} \exp (Z_{ij}^{(L)})))
  13.  |   dZijL=I{yi=j}+exp(Zij(L))j=1Mexp(Zij(L))    i,jdZ_{ij}^{L}=-\mathbb{I}\lbrace y_i=j\rbrace + \frac{\exp (Z_{ij}^{(L)})}{\sum_{j=1}^M \exp(Z_{ij}^{(L)})} \ \ \ \ \forall i,j
  14. Backward Propagation
  15. for l=L,,1l=L,\cdots,1 do
  16.  |   dZ(l)=dQ(l)h(Z(l))d{\bf Z}^{(l)}=d{\bf{Q}}^{(l)}\odot h' ({\bf{Z}}^{(l)})       Do not execute this line for the last layer l=Ll=L
  17.  |   dQ(l1)=dZ(l)W(l)d{\bf Q}^{(l-1)} = d{\bf{Z}}^{(l)}{\bf{W}}^{(l)}
  18.  |   dW(l)=1nbdZ(l)TQ(l1)d{\bf W}^{(l)} = \frac{1}{n_b} d{\bf{Z}}^{(l)T}{\bf{Q}}^{(l-1)}
  19.  |   dbj(l)=1nbi=1nbdZij(l)      jd{\bf b}^{(l)}_j = \frac{1}{n_b} \sum_{i=1}^{n_b} dZ_{ij}^{(l)}\ \ \ \ \ \ \forall j
  20. end
  21. θJ(θ)=[vec(dW(1))Tdb(1)Tvec(dW(L))Tdb(L)T]\nabla_\theta J(\theta)=\begin{bmatrix}\text{vec}(d{\bf{W}}^{(1)})^T & d{\bf b}^{(1)T} & \cdots & \text{vec}(d{\bf W}^{(L)})^T & d{\bf b}^{(L)T}\end{bmatrix}
  22. return J(θ),θJ(θ)J(\theta), \nabla_\theta J(\theta)

Initialisation

Convolutional Neural Networks

Convolutional Layers

Strided Convolution allows for control over how many pixels in the filter shifts over at each step.

Pooling Layers

Multiple Channels

Full CNN Architecture

Dropout