COMP4702 Lecture 7
Neural Networks
- A "neuron" in an (artificial) neural network is a highly simplified model of what goes on in a biological neuron
- An artificial neuron is a processing unit that takes in several inputs via weighted connections.
- The neuron computes a function of the sum of these weighted inputs and produces an output signal.

Figure 1 - Artificial Neuron (a) Generalised linear regression model (b) Output described as sum of all terms.
- Observe that the illustration on the left describes the output
as the sum of all terms, and - The generalised linear regression neuron (Figure 1 a) is the simplest type of neuron that we can have.
- We can also have neurons that compute some other function
, which is a function of the weighted sum of the inputs. - This is known as the activation function (an analogy to the idea of a biological neuron firing and producing an electrical output if it has enough stimulus in its inputs)
- The two most common activation functions are the logistic sigmoid and the ReLU.

Figure 2 - Plots of two popular activation functions in Machine Learning - ReLU (Rectified Linear Unit) and the logistic (sigmoid) function
- A linear neuron and a neuron with logistic activation are mathematically identical to the linear and logistic regression model.
- A single layer of linear units can be trained to perform multiple linear regression (i.e. a regression model with multiple outputs)
- A single layer of sigmoidal units is close to the multi-class regression model. Using softmax instead of logistic sigmoids will make it identical
Two-Layered Neural Network
- If we have a single-layer model, then the notation
is the output of the th neuron, .
- These outputs are connected to a second layer of neurons, which provide the final output of the model
- The first layer is called the hidden layer
- Every connection has a weight (parameter) so the notation
denotes the weight from input 3 to neuron 7 in layer 1.
- Then out final model is denoted as
- There is also a bias/offset signal/weight for the hidden layer.
Vectorisation over Units
- We observe that a two-layer (feed-forward) neural network (aka, multi-layer perceptron) can be written more compactly in terms of matrices and vectors.
- The parameters in each layer are stacked in a weight matrix W, and an offset vector b
- Using this notation, the full model can then be written as:
- The weight matrices and offset vectors are the parameters of the model, which can be compiled as a single vector (in the 2-layered perceptron shown above):
Deep Neural Networks
- We can extend the model to any number of layers, creating a deep neural network
- The notation
indexes the layers,and the number of units in the th layer is . - Each unit is fully connected to the inputs/units/output of the previous and next layer.

Figure 3 - An example of a deep neural network with L layers.
- Observe that each layer of the neural network
is parameterized by a weight matrix and bias vector
- In the early days of feed-forward neural networks, these models were seen as examples of parallel distributed processing.
- If we think about an input signal propagating from the inputs through the layers, computation across units in any given layer is independent (and thus could be parallelised).
- If we want to compute the output for a whole set of data points, computation required for each data point is independent.
- Motivation to represent at vectors directly related to why GPUs are useful for deep learning.
Neural Networks for Classification
- For the regression case, we just have linear activation on all the outputs.
- For classification, we use the same building blocks.
- For binary classification, a single output unit with logistic sigmoid can do the job
- For
classes, we have output units and use softmax ensuring that where is the output value that estimates . - Example 6.1 demonstrates classification using neural networks on the MNIST dataset.
Example 6.1 - MNIST Problem

Figure 4 - Samples from the MNIST datasets
-
Dataset containing 28x28 pixel grayscale images of handwritten digits
-
Contains 60,000 training and 10,000 validation points
-
To feed the image into a Multi-Layer Perceptron, we flatten the 28x28=784 pixels into one long vector
-
Each item in the vector represents the intensity of the interval.
- After normalisation, the intensity value lies in the domain
.
- After normalisation, the intensity value lies in the domain
-
The will have to predict between 10 classes, representing each of the 10 digits.
-
Based on a set of training data
with images and labels, the problem is a good model for the class probabilities
- That is, the probability that some unseen image
belongs to each of the classes. - Assume that we would like to use logistic regression to solve this problem with the softmax output
- This is identical to a neural network with just one layer.
- The parameters of this model would then be:
- Which gives a total of
parameters. - Extending this model with a two-layer neural network with
hidden units, that would require two sets of weight matrices and offset vectors:
- This produces a model with
parameters.
Training a Neural Network
- Training is an optimisation problem over the parameters of the model.
- If we want to refer to all of the model parameters, we collect all of the weights (and bias weights) and arrange them in a big vector
. - The loss function is a sum over the losses for each data point in the training set.
- Gradient descent is the basis of the most commonly used training algorithms
- How can we get the gradient for a multi-layer neural network with non-linear activation functions?
Backpropagation
- Lindholm uses "backpropagation" to refer to computing the gradient
- To get the gradient for MLPs, calculus chain rule is critical.
- Mathematical expressions in book can be interpreted as follows:
- Need partial derivative of
with respect to each weight (and bias weight), in each layer of the network (Eq 6.22). - The gradients with respect to the weights are given as:
- The gradients with respect to the weights are given as:
- We use these to do our update for gradient descent (Eqn 6.23)
Derivation of Backpropagation
- The backpropagation step of a network consists of two steps - forward propagation and backpropagation
Forward-Propagation
- In the forward propagation step, the cost function is evaluated using the neural network model.
- We start with the input
which is an input to the first layer.
- We repeatedly apply the weight and bias to the previous layer's output.
is the weighted sum of inputs before the application of the activation function . is the output of the perception
- We the evaluate the last layer,
.
- Given the output of the model, denoted
, we compute the cost function of the model to evaluate the performance of the model.
- Observe that the objective function uses the squared difference for regression, and cross-entropy for classification.
Back-Propagation
- The gradient with respect to the (total weighted sum of) input signals to the units in a layer
and the output signals of units in a layer given in Eq 6.25
-
Then, we start calculating the gradients at the output layer, using the derivative of the activation function and
in Eq 6.26. -
6.26a describes that for regression problems, the derivative at the end of our MLP which uses squared-error loss is given by:
-
6.26b describes the derivative for a multi-class classification problem with cross-entropy loss.
-
-
The gradients for the weight and bias weights in that layer can be computed, and the gradient signals in the current layer are used to compute the gradients for the previous layer Eq 6.27
- Note here that
denotes the element-wise product and is the derivative of the activation function . acts element-wise on

Figure 5 - Graphical representation of the backpropagation algorithm
-
Observe that in each of the forward propagation steps, the weights, weighted sum and neuron output are recorded to be used later.
-
Furthermore, in the backpropagation step, we need the gradient with respect to the weight matrix and bias vector.
-
All equations used for backpropagation can be derived from the chain rule of calculus.
-
So far, we have only considered backpropagation for one data point.
- However, we do want to compute the cost function
and its gradients and for the whole mini-batch .
- However, we do want to compute the cost function
-
Therefore, run Eq. 6.24, 6.26 and 6.27 for all data points in the current mini-batch and average their results for
, and , -
To do this in a computationally efficient manner, process all
data points in the mini-batch simultaneously by stacking them in matrices where each row represents one data point
Backpropagation - Algorithm
Input
:
- Parameters
- Activation function
- Data
with rows where each row corresponds to one data point in the current mini-batch Result
:, of the current mini-batch
- Forward-Propagation
- Set
- for
do - |
- |
Do not execute this line for the last layer - end
- Evaluate the cost function
- if Regression problem then
- |
- |
- else if Classification problem then
- |
- |
- Backward Propagation
- for
do - |
Do not execute this line for the last layer - |
- |
- |
- end
- return
Weight Initialisation
- Weights are typically initialised to small, random values at the start of gradient-descent based training
MNIST Classification
We see the effect of a single-layered logistic regression model vs one with a single hidden layer.
- Use Stochastic Gradient Descent with
and mini-batch - Since there are 60,000 training points in total, one epoch is completed after
iterations - Run the algorithms for 15 epochs
- Since there are 60,000 training points in total, one epoch is completed after


Figure 8 - Effect of adding an additional layer to the MNIST classification example.
- In the original model, the misclassification rate is approximately 8% on validation data
- The addition of a new hidden layer with
hidden units and ReLU activation function is given. - It decreases the misclassification rate to 2%
Convolutional Neural Networks
- Big part of the reason why deep learning has been a popular part of the surge of interest in AI/ML.
- Fundamental idea to take into account structure in data, specifically in terms of relationship between different inputs in the data
- CNNs were originally applied to image data, but have been applied to other types of data with spatial and/or temporal structure

Figure 9 - The process of converting an image (e.g. sample like MNIST) into matrix representation
- When converting an image into the data representation, we convert the image into its data representation, in which the number represents the intensity of the pixels.
- The pixel values are stored in a matrix with elements denoted
. - If we have 36 hidden units for this output, a fully connected (dense) layer would require 1296 weights (plus biases).
- Convolutional layers have far fewer parameters and do so by enforcing some useful structure:
- Sparse Interactions: Lining up the input matrix with the hidden matrix, only a limited local region of inputs are connected to each hidden unit, in which the borders are typically handled with zero-padding.
- Parameter Sharing: Rather than having separate weights for each hidden unit, force all hidden units to have the same weight values, called a "filter".
- The weights in the filter are learned in every position that the filter can be in
- Can have more than one filter in a layer, with each filter having its own weight parameters
- The convolution operation is mathematically represented with the following equation:
- In this equation,
denotes the zero-padded input to the layer output of layer Filter with rows and columns

Figure 9 - An example of how the convolution of a matrix is computed. Each hidden unit's value is only dependent on a small region of pixels around its location
- This can be thought of as the same as the fully-connected/dense network where each of the weights that are not in the neighbourhood are set to 0.
- The properties of Sparse Interactions and Parameter Sharing make the CNN relatively invariant to translations of objects in images
Strides in Convolutional Neural Networks
- Another technique to reduce computation is to move the filter (input window) by more than one pixel at a time when computing the output as shown in the figure below.
- Thinking about the hidden units as activation values obtained by moving the filter around
- The number of hidden units is determined by the filter size and stride.
- The edges can get messy depending on what sizes are chosen.

Figure 10 - An example of how the convolution of a matrix is computed. Each hidden unit's value is only dependent on a small region of pixels around its location
Pooling Layer in Convolutional Neural Network
- A layer that has no trainable parameters, but used as a component to reduce the size of the model
- Done by computing average / maximum of values over a given filter size.
- In the first iteration of the figure,
- In the second iteration of the figure
- In the first iteration of the figure,

Figure 10 - An example of how the convolution of a matrix is computed. Each hidden unit's value is only dependent on a small region of pixels around its location
- Pooling can make the model more invariant to (small) translations of the input
Multiple Convolutional Channels
- With a single filter, we produce a matrix of "hidden units"
- If we use multiple filters, we get a stack of these matrices called channels.
- The hidden layer of units is a tensor
- When a CNN is trained on colour images, it is typical to have one (or more) separate channels for the Red, Green and Blue intensity values of each pixel in the image.
Full CNN Architecture

Figure 11 - An example architecture for a convolutional neural network used for the classification of 6x6 grayscale images
- The typical CNN architecture is shown above.
- Typically have a series of convolutional layers to decrease the dimensionality of the images
- Then have some dense layers (a MLP) to take the convolution outputs and classify them
- In the case of classification, the last layer has
outputs - one for each class. - The output is encoded in the one-hot fashion.
Dropout
- When training, randomly set some hidden units to 0 so that they don't do anything.
- Compute the gradient for the remaining sub-network
- On the next training iteration, choose other layers to drop out
- This can reduce model complexity.
- Finally at the end of training all the hidden units are restored and used together.