COMP4702 Lecture 7

Neural Networks

Figure 1 - Artificial Neuron (a) Generalised linear regression model (b) Output described as sum of all terms.

Figure 2 - Plots of two popular activation functions in Machine Learning - ReLU (Rectified Linear Unit) and the logistic (sigmoid) function


Two-Layered Neural Network

qk=h(Wk1x1+Wk2x2++Wkpxp+bk),     k=1,,U(6.4) q_k = h(W_{k1} x_1 + W_{k2} x_2 + \cdots + W_{kp} x_p + b_k), \ \ \ \ \ k=1, \cdots, U \tag{6.4}
y^=W1q1+W2q2++WUqU+b(6.5)\hat{y}=W_1 q_1 + W_2 q_2 + \cdots + W_U q_U + b \tag{6.5}
q1=h(W11(1)x1+W12(1)x2+W1 p(1)+b1(1)),q2=h(W21(1)x1+W22(1)x2+W2 p(1)+b2(1)),qU=h(WU1(1)x1+WU2(1)x2+WU p(1)+bU(1)),(6.6a) \begin{align*} q_1 &= h(W_{11}^{(1)} x_1 + W_{12}^{(1)} x_2 + \cdots W_{1\ p}^{(1)} + b_1^{(1)}),\\ q_2 &= h(W_{21}^{(1)} x_1 + W_{22}^{(1)} x_2 + \cdots W_{2\ p}^{(1)} + b_2^{(1)}),\\ \vdots\\ q_U&=h(W_{U1}^{(1)} x_1 + W_{U2}^{(1)} x_2 + \cdots W_{U\ p}^{(1)} + b_U^{(1)}),\\ \end{align*} \tag{6.6a}
y^=W1(2)q1+W2(2)q2++WU(2)+b(2)(6.6b)\hat{y}=W_1^{(2)}q_1 + W_2^{(2)}q_2 + \cdots + W_U^{(2)} + b^{(2)} \tag{6.6b}

Vectorisation over Units

W(1)=[W11(1)W1p(1)WU1(1)WUp(1)],    b(1)=[b1(1)bU(1)],W(2)=[W1(2)WU(2)],b(2)=[b(2)],(6.7) \begin{align*} {\bf{W}}^{(1)} &= \begin{bmatrix} W_{11}^{(1)} & \cdots & W_{1p}^{(1)}\\ \vdots & & \vdots \\ W_{U1}^{(1)} & \cdots & W_{U p}^{(1)}\\ \end{bmatrix}, \ \ \ \ &&{\bf{b}}^{(1)}= \begin{bmatrix} b_1^{(1)}\\ \vdots\\ b_U^{(1)} \end{bmatrix}, \\ W^{(2)} &= \begin{bmatrix} W_1^{(2)} & \cdots & W_U^{(2)} \end{bmatrix}, &&{\bf{b}}^{(2)} = \begin{bmatrix} b^{(2)} \end{bmatrix}, \end{align*} \tag{6.7}
q=h(W(1)x+b(1)),y^=W(2)q+b(2), \begin{align} &{\bf{q}} = h({\bf{W}}^{(1)} x + {\bf{b}}^{(1)}), \tag{6.8a} \\ &\hat{y}={\bf{W}}^{(2)} {\bf{q}} + {\bf{b}}^{(2)}, \tag{6.8b} \end{align}
θ=[vec(W(1))Tvec(b(1))Tvec(W(2))Tvec(b(2))T] \def\vec{\text{vec}} \def\W {{\bf{W}}} \def\b {{\bf{b}}} \theta=\begin{bmatrix} \vec(\W^{(1)})^T & \vec(\b^{(1)})^T & \vec(\W^{(2)})^T & \vec(\b^{(2)})^T \end{bmatrix}

Deep Neural Networks

Figure 3 - An example of a deep neural network with L layers.


Neural Networks for Classification

Example 6.1 - MNIST Problem

Figure 4 - Samples from the MNIST datasets

p(y=mx),      m=0,,9 p(y=m|{\bf{x}}), \ \ \ \ \ \ m=0, \cdots, 9
W(1)R784×10     x(1)R10{\bf{W}}^{(1)} \in \mathbb{R}^{784 \times 10} \ \ \ \ \ {\bf{x}}^{(1)}\in \mathbb{R}^{10}
W(1)R782×200    b(1)R200,    W(2)R200×10    b(2)R10 \def\W{{\bf{W}}} \def\b{{\bf{b}}} \W^{(1)}\in\mathbb{R}^{782\times200}\ \ \ \ \b^{(1)}\in\mathbb{R}^{200},\ \ \ \ \W^{(2)}\in\mathbb{R}^{200\times10} \ \ \ \ \b^{(2)}\in\mathbb{R}^{10}

Training a Neural Network

Backpropagation

db(l)=b(l)J(θ)=[J(θ)b1b1(l)J(θ)b1bU(l)(l)] d{\bf{b}}^{(l)} \overset{\triangle}{=} \nabla_{\partial {\bf{b}}^(l)} J(\theta)=\begin{bmatrix} \frac{\partial J(\theta){\partial b_1}}{\partial b_1^{(l)}}\\ \vdots\\ \frac{\partial J(\theta){\partial b_1}}{\partial b_{U^{(l)}}^{(l)}}\\ \end{bmatrix}
Wt+1(l)Wt(l)γdWt(l),(6.23a) \def\W{{\bf{W}}} \def\b{{\bf{b}}} W_{t+1}^{(l)} \leftarrow \W_t^{(l)} - \gamma d\W_t^{(l)},\tag{6.23a}
bt+1(l)bt(l)γdbt(l),(6.23b) \def\W{{\bf{W}}} \def\b{{\bf{b}}} b_{t+1}^{(l)} \leftarrow \b_t^{(l)} - \gamma d\b_t^{(l)},\tag{6.23b}

Derivation of Backpropagation

Forward-Propagation

q(0)=x(2.64a){\bf{q}}^{(0)}={\bf{x}}\tag{2.64a}
{z(l)=W(l)q(l1)+b(l)q(l)=h(z)(l)forl=1,,L1(6.24b) \begin{cases} {\bf{z}}^{(l)} = {\bf W} ^ {(l)} {\bf q} ^ {(l-1)} + {\bf b}^{(l)}\\ q^{(l)} = h({\bf{z}})^{(l)} \end{cases} \text{for} l=1, \cdots, L-1 \tag{6.24b}
z(L)=W(L)q(L1)+b(L)(6.24c){\bf{z}}^{(L)}={\bf{W}}^{(L)} {\bf{q}}^{(L-1)} + {\bf{b}}^{(L)} \tag{6.24c}
J(θ)={(yz(L))2,if regression problem,zy(L)+lnj=1Mezj(L),if classification problem(6.24d)J(\theta) = \begin{cases} (y-z^{(L)})^2, &\text{if regression problem,}\\ -z_y^{(L)} + \ln\sum_{j=1}^M e^{z_j^{(L)}}, &\text{if classification problem} \end{cases} \tag{6.24d}

Back-Propagation

dq(l)q(l)J(θ)=[J(θ)q1(l)J(θ)qU(l)(l)](6.25b) d{\bf q}^{(l)} \triangleq \nabla_{{\bf{q}}^{(l)}} J(\theta)= \begin{bmatrix} \frac{\partial J(\theta)}{\partial q_1^{(l)}}\\ \vdots\\ \frac{\partial J(\theta)}{\partial q_{U^{(l)}}^{(l)}} \end{bmatrix} \tag{6.25b}
dz(l)=dq(l)h(z(l))(6.27a) d{\bf{z}}^{(l)} = d{\bf q}^{(l)} \odot h' ({\bf z}^{(l)}) \tag{6.27a}
dq(t1)=W(t)Td(z(l))(6.27a) d{\bf{q}}^{(t-1)} = {\bf W}^{(t)T} d({\bf z}^{(l)}) \tag{6.27a}

Figure 5 - Graphical representation of the backpropagation algorithm

Backpropagation - Algorithm

Input:

  1. Forward-Propagation
  2. Set Q0X{\bf{Q}}^0\leftarrow \bf{X}
  3. for l=1,,Ll=1,\cdots, L do
  4.  |   Z(l)=Q(l)T+b(l)T{\bf{Z}}^{(l)} = {\bf{Q}}^{(l)T} + {\bf{b}}^{(l)T}
  5.  |   Q(l)=h(Z(l)){\bf{Q}}^{(l)} = h({\bf{Z}}^{(l)})        Do not execute this line for the last layer l=Ll=L
  6. end
  7. Evaluate the cost function
  8. if Regression problem then
  9.  |   J(θ)=1nbi=1nb(yiZi(L))2J(\theta)=\frac{1}{n_b} \sum_{i=1}^{n_b} (y_i-Z_i^{(L)})^2
  10.  |   dZ(L)=2(yZ(L))d{\bf{Z}^{(L)}}=-2({\bf{y}}-{\bf{Z}}^{(L)})
  11. else if Classification problem then
  12.  |   J(θ)=1nbi=1nb(Zi,yi(L)+ln(j=1Mexp(Zij(L))))J(\theta) = \frac{1}{n_b} \sum_{i=1}^{n_b} (-Z_{i,y_i}^{(L)} + \ln (\sum_{j=1}^{M} \exp (Z_{ij}^{(L)})))
  13.  |   dZijL=I{yi=j}+exp(Zij(L))j=1Mexp(Zij(L))    i,jdZ_{ij}^{L}=-\mathbb{I}\lbrace y_i=j\rbrace + \frac{\exp (Z_{ij}^{(L)})}{\sum_{j=1}^M \exp(Z_{ij}^{(L)})} \ \ \ \ \forall i,j
  14. Backward Propagation
  15. for l=L,,1l=L,\cdots,1 do
  16.  |   dZ(l)=dQ(l)h(Z(l))d{\bf Z}^{(l)}=d{\bf{Q}}^{(l)}\odot h' ({\bf{Z}}^{(l)})       Do not execute this line for the last layer l=Ll=L
  17.  |   dQ(l1)=dZ(l)W(l)d{\bf Q}^{(l-1)} = d{\bf{Z}}^{(l)}{\bf{W}}^{(l)}
  18.  |   dW(l)=1nbdZ(l)TQ(l1)d{\bf W}^{(l)} = \frac{1}{n_b} d{\bf{Z}}^{(l)T}{\bf{Q}}^{(l-1)}
  19.  |   dbj(l)=1nbi=1nbdZij(l)      jd{\bf b}^{(l)}_j = \frac{1}{n_b} \sum_{i=1}^{n_b} dZ_{ij}^{(l)}\ \ \ \ \ \ \forall j
  20. end
  21. θJ(θ)=[vec(dW(1))Tdb(1)Tvec(dW(L))Tdb(L)T]\nabla_\theta J(\theta)=\begin{bmatrix}\text{vec}(d{\bf{W}}^{(1)})^T & d{\bf b}^{(1)T} & \cdots & \text{vec}(d{\bf W}^{(L)})^T & d{\bf b}^{(L)T}\end{bmatrix}
  22. return J(θ),θJ(θ)J(\theta), \nabla_\theta J(\theta)

Weight Initialisation

MNIST Classification

We see the effect of a single-layered logistic regression model vs one with a single hidden layer.

Figure 8 - Effect of adding an additional layer to the MNIST classification example.

Convolutional Neural Networks

Figure 9 - The process of converting an image (e.g. sample like MNIST) into matrix representation

qij=h(k=1Fl=1Fxi+k1,j+l1Wk,l)(6.29)q_{ij}=h\left(\sum_{k=1}^{F} \sum_{l=1}^{F} x_{i+k-1,j+l-1} W_{k,l}\right) \tag{6.29}

Figure 9 - An example of how the convolution of a matrix is computed. Each hidden unit's value is only dependent on a small region of pixels around its location

Strides in Convolutional Neural Networks

Figure 10 - An example of how the convolution of a matrix is computed. Each hidden unit's value is only dependent on a small region of pixels around its location

Pooling Layer in Convolutional Neural Network

Figure 10 - An example of how the convolution of a matrix is computed. Each hidden unit's value is only dependent on a small region of pixels around its location

Multiple Convolutional Channels

Full CNN Architecture

Figure 11 - An example architecture for a convolutional neural network used for the classification of 6x6 grayscale images

Dropout