COMP4702 Lecture 4

Error Function


Enew=ΔE[E(y^(x;T),y)](4.2) E_{\text{new}} \overset{\Delta}{=} \mathbb{E}_\star [ E ( {\color{lightblue}\hat{y}({\bf{x_\star}}; \mathbb{T}), y_\star})]\tag{4.2}

Motivation for Estimating E_new


Why E_train isn't E_new

Estimation via Validation Set

Figure 1 - Partition of dataset into training and hold-out set.

Estimation via k-Fold Cross Validation

Figure 2 - k-fold cross-validation

Training Error - Generalisation Gap Decomposition of E_New

Enew=ΔET[Enew(T)](4.8a) \overline{E}_\text{new}\overset{\Delta}{=}\mathbb{E}_{\mathcal{T}}[E_\text{new}(\mathcal{T})]\tag{4.8a}
Etrain=ΔET[Etrain(T)](4.8b) \overline{E}_\text{train}\overset{\Delta}{=} \mathbb{E}_{\mathcal{T}}[E_\text{train}(\mathcal{T})]\tag{4.8b}
generalisation gap=ΔEnewEtrain(4.10) \text{generalisation gap}\overset{\Delta}{=}\overline{E}_\text{new}-\overline{E}_\text{train}\tag{4.10}

Figure 3 - Behaviour of Enew and Etrain for many supervised machine learning techniques are a function of the model complexity.

Training Error - Generalisation Gap Example

Figure 4 - Optimal decision boundary for classification problem.

Figure 4 - Optimal decision boundary for classification problem.

k-NN with k=70 k-NN with k=20 k-NN with k=2
Etrain\overline{E}_\text{train} 0.24 0.22 0.17
Enew\overline{E}_\text{new} 0.25 0.23 0.30
generalisation gap\text{generalisation gap} 0.1 0.1 0.3

Minimising Training Gap

Figure 5 - Optimal decision boundary for classification problem.

Example: Training Error vs Generalisation Gap

Figure 6 - Generalisation Gap of Various Models

Bias Variance Decomposition of E_new

If we knew z0z_0 in reality all of this would be redundant. However, in practice we do not know z0z_0 and therefore we must predict it


Figure 7 - Decision boundaries for Decision Trees with different depths


Figure 8 - Relationship between Bias, Variance and the size of the training set, n

Figure 9 - Effect of regularisation on error in polynomial regression model.

Figure 10 - ROC (left) and Precision-Recall curves are two types of curves used to evaluate the performance of different classifier cut-offs.

Example 4.5

y=normal y=abnormal
y^(x)=normal\hat{y}({\bf{x}})=\text{normal} 3177 237
y^(x)=abnormal\hat{y}({\bf{x}})=\text{abnormal} 1 13
y=normal y=abnormal
y^(x)=normal\hat{y}({\bf{x}})=\text{normal} 3067 165
y^(x)=abnormal\hat{y}({\bf{x}})=\text{abnormal} 111 85