Lindholm et al, Chapter 4

A summary of Lindholm Chapter 4 - Understanding, Evaluating and Improving Performance

Expected New Data Error E_new: Performance in Production

This section is really also regarding model selection

Misclassification: E(y^,y)I(y^y){0if y^=y1if y^y(4.1a) \text{Misclassification}:\ E(\hat{y}, {y})\triangleq \mathbb{I}(\hat{y}\ne y) \begin{cases} 0&\text{if } \hat{y}=y\\1&\text{if }\hat{y}\ne y\end{cases}\tag{4.1a}\\

Supervised Learning as Inputs over Random Distribution

Integration-Based Error Rate

Enew=ΔE[E(y^(x;T),y)](4.2) E_{\text{new}} \overset{\Delta}{=} \mathbb{E}_\star [ E ( \hat{y}({\bf{x_\star}}; \mathbb{T}), y_\star)]\tag{4.2}
E[E(y^(x;T),y)]=E(y^(x;T),y)p(x,y)dxdy(4.3) \mathbb{E}_\star [ E ( \hat{y}({\bf{x_\star}}; \mathbb{T}), y_\star)]=\int E(\hat{y}({\bf{x_\star}};\mathcal{T}), y_\star) p({\bf{x_\star}},y_\star) d{\bf{x_\star}} dy_\star \tag{4.3}

Etrain=Δ=1ni=1nE(y^(xi;T),yi)(4.4) E_\text{train} \overset{\Delta}{=} = \frac{1}{n}\sum_{i=1}^n E(\hat{y}({\bf{x}}_i;\mathcal{T}), y_i)\tag{4.4}

Estimating E new

Cannot Estimate E new from Training Data

Hold-Out Validation Data

Ehold-out=Δ1nvj=1nvE(y^(xj;T),yj)(4.6) \def\rq{``} E_\text{hold-out} \overset{\Delta}{=} \frac{1}{n_v} \sum_{j=1}^{n_v} E(\hat{y}({\bf{x}}_j\rq;\mathcal{T}), y_j\rq)\tag{4.6}
Ehold-out=Enew E_\text{hold-out} = E_\text{new}

k-Fold Cross-Validation

Why does this work?

Training Time

Using a Test Dataset

Augmenting Training Set

The Training Error-Generalisation Gap Decomposition of E new

This section discusses over-fitting and under-fitting

Eˉnew=ΔET[Enew(T)](4.8a) \bar{E}_\text{new}\overset{\Delta}{=}\mathbb{E}_{\mathcal{T}}[E_\text{new}(\mathcal{T})]\tag{4.8a}
Eˉtrain=ΔET[Etrain(T)](4.8b) \bar{E}_\text{train}\overset{\Delta}{=} \mathbb{E}_{\mathcal{T}}[E_\text{train}(\mathcal{T})]\tag{4.8b}
Eˉtrain<Eˉnew(4.9) \bar{E}_\text{train} \lt \bar{E}_\text{new}\tag{4.9}
generalisation gap=ΔEˉnewEtrain(4.10) \text{generalisation gap}\overset{\Delta}{=}\bar{E}_\text{new}-E_\text{train}\tag{4.10}
Eˉnew=Eˉtrain+generalisation gap(4.11) \bar{E}_\text{new}=\bar{E}_\text{train}+\text{generalisation gap}\tag{4.11}

Generalisation Gap Factors




Figure 1 - Over-fitting vs Under-fitting

Binary Classification Example

Figure 2 -   Optimal Decision Boundary for Classification Problem

Figure 3 -   Example of Over-fitting and Under-fitting in KNN Classification

kk-NN with k=70k=70 kk-NN with k=20k=20 kk-NN with k=2k=2
Eˉtrain\bar{E}_\text{train} 0.24 0.22 0.17
Eˉnew\bar{E}_\text{new} 0.25 0.23 0.30

Figure 4  - Behaviour of Training Models with increase in size of training data. Simple model (left) vs complex model (right)

Reducing E new in Practice



Shortcomings of Model Complexity Scale

Example - Training Error and Generalisation Gap for Regression

Figure 5  - Evaluation of training loss and generalisation gap

Bias-Variance Decomposition of E_new

Recap of Bias and Variance

Bias:zz0(4.12a) \text{Bias}: \overline{z}-z_0\tag{4.12a}
Variance:E[(zz)2]=E[z2]z2(4.12b) \text{Variance}: \mathbb{E}[(z-\overline{z})^2]=\mathbb{E}[z^2]-\overline{z}^2\tag{4.12b}

E[(zz0)2]=E[((zz)+(zz0))2]==E[(zz)2]Variance+2(E[z]z)0(zz0)+(zz0)2bias2(4.13) \begin{align*} \mathbb{E}[(z-z_0)^2]&=\mathbb{E}[((z-\overline{z}) + (\overline{z}-z_0))^2]=\\ &=\underbrace{\mathbb{E}[(z-\overline{z})^2]}_{\text{Variance}} +2 \underbrace{(\mathbb{E}[z] - \overline{z})}_{\text{0}} (\overline{z}-z_0) + \underbrace{(\overline{z}-z_0)^2}_{\text{bias}^2} \end{align*}\tag{4.13}

Bias and Variance in a Machine Learning Context


y=f0(x)+ε, with E[ε]=0 and var(ε)=σ2(4.14) \def\x{{\bf{x}}} y=f_0(\x)+\varepsilon, \text{ with } \mathbb{E}[\varepsilon]=0 \text{ and var}(\varepsilon)=\sigma^2\tag{4.14}
f(x)=ΔET[y^(x;T)](4.15)\overline{f}({\bf{x}})\overset{\Delta}{=}\mathbb{E}_\mathcal{T}[\hat{y}({\bf{x}};\mathcal{T})]\tag{4.15}

Enew=ET[E[(y^(x;T)y)2]](4.16)\overline{E}_\text{new}=\mathbb{E}_\mathcal{T}[\mathbb{E}_\star[(\hat{y}({\bf{x_\star}};\mathcal{T})-y_\star)^2]]\tag{4.16}
Enew=E[ET[(y^(x;T)f0(x)ε)2]](4.17)\overline{E}_\text{new}=\mathbb{E}_\star[\mathbb{E}_\mathcal{T}[(\hat{y}({\bf{x_\star}};\mathcal{T})-f_0({\bf{x_\star}})-\varepsilon)^2]]\tag{4.17}
ET[(y^(x;T)“z”f0(x)z0ε)2] =(f(x)f0(x))2+ET[(y^(x;T))f(x)2]+ε(4.18) \def\e{\mathbb{E}} \def\x{{\mathbb{x}}} \def\t{\mathcal{T}} \def\yhat{\hat{y}} \def\model{\yhat({\bf{x_\star}};\t)} \e_\t[(\underbrace{\model}_\text{``z''} - \underbrace{f_0({\bf{x_\star}})}_{\text{``}z_0\text{''}}-\varepsilon)^2]\\ \ \\ % add a bit of whitespace to separate the two lines =(\overline{f}({\bf{x_\star}}) - f_0({\bf{x_\star}}))^2 + \e_\t[(\model)-\overline{f}(\bf{x_\star})^2]+\varepsilon \tag{4.18}
Enew=E[(f(x)f0(x))2]Bias2+E[ET[(y^(x;T)f(x))2]]variance+    σ2    Irreducible error(4.19) \def\e{\mathbb{E}} \def\x{{\mathbb{x}}} \def\xstar{{\bf{x_\star}}} \def\t{\mathcal{T}} \def\yhat{\hat{y}} \def\model{\yhat({\bf{x_\star}};\t)} \overline{E}_\text{new}= \underbrace{\e_\star[(\overline{f}(\xstar)-f_0(\xstar))^2]} _{\text{Bias}^2} + \underbrace{\e_\star[\e_\t[(\yhat(\xstar;\t)-\overline{f}(\xstar))^2]]} _{\text{variance}} + \underbrace{\ \ \ \ \sigma^2 \ \ \ \ }_ \text{Irreducible error} \tag{4.19}

Figure 6 - Model Complexity vs Error considering the bias-variance decomposition of Enew. Observe that low model complexity means high bias. More complex models adapts to noise in the training data which results in higher variance. This means that to achieve small Enew we need to select a suitable model complexity level. This is called the bias-variance tradeoff.

Factors Affecting Bias and Variance

Example: Bias-Variance Tradeoff for L2L^2 Regularised Linear Regression

x U[0,1]x~\mathcal{U}[0,1]
y=52x+x3+ε,     ε N(0,1)(4.20)y=5-2x+x^3+\varepsilon, \ \ \ \ \ \varepsilon~\mathcal{N}(0,1)\tag{4.20}
y=β0+β1x+β2x2+β3x3+β4x4+ε(4.21)y=\beta_0+\beta_1x + \beta_2x^2+\beta_3x^3+\beta_4x^4+\varepsilon\tag{4.21}

Figure 8 - Effect of regularisation on error in polynomial regression model.

Connections between Bias, Variance and the Generalisation Gap


σ2+bias2=E[(f(x)y)2]1ni=1n(f(xi)yi)21ni=1n(y^(xi;T)yi)2=Etrain(4.22) \sigma^2+\text{bias}^2 =\mathbb{E}_\star[(\overline{f}({\bf{x}_\star})-y_\star)^2]\\ \approx\frac{1}{n}\sum_{i=1}^{n}(\overline{f}({\bf{x}}_i)-y_i)^2\\ \ge \frac{1}{n}\sum_{i=1}{n}(\hat{y}({\bf{x}}_i;\mathcal{T})-y_i)^2=E_\text{train}\tag{4.22}
generalisation gap>variance(4.23a) \text{generalisation gap}\overset{\gt}{\approx} \text{variance}\tag{4.23a}
Etrain>bias2+σ2(4.23b) E_\text{train}\overset{\gt}{\approx}\text{bias}^2+\sigma^2 \tag{4.23b}

Additional Tools for Evaluating Binary Classifiers

Confusion Matrix and ROC Curve

y=1y=-1 y=1y=1 total\text{total}
y^(x)=1\hat{y}({\bf{x}})=-1 True Negative False Negative N
y^(x)=1\hat{y}({\bf{x}})=1 False Positive False Negative P
total\text{total} N P n

P(N)P(N) denotes total number of positive (negative) examples in the dataset P(N)P^*(N^*) denotes the total number of positive (negative) predictions made by the model.

recall=TPP=TPTP+FN\text{recall}=\frac{TP}{P}=\frac{TP}{TP+FN}
precision=TPP=TPTP+FP\text{precision}=\frac{TP}{P^*}=\frac{TP}{TP+FP}
Ratio Name
FP/N Fall-out, Probability of False Alarm False Positive Rate
TN/N Specificity, Selectivity True Negative Rate
TP/P Sensitivity, Power, Probability, Probability of Detection True Positive Rate, Recall
FN/P Miss Rate False Negative rate
TP/P* Positive Predictive rate Precision
FP/P* False Discovery Rate
TN/N* Negative Predictive Value
FN/N* False omission rate
P/n Prevalence
(FN + FP)/n Misclassification Rate
(TN + TP)/n 1 - misclassification rate Accuracy
2TP/ (P*+P) F1F_1 score
(1+β2)TP/((1+β2)TP+β2FN+FP)(1 + \beta^2) TP / ((1 + \beta^2) TP + \beta^2 FN + FP) FβF_\beta score

Recall: How much of the positive data points are correctly predicted as positive Precision: Ratio of TP are among those predicted positive ROC (Receiver Operating Characteristics): Plotting the ROC curve can be useful in comparing classifiers with different threshold values rr.

Figure 9 - ROC Curve for Classifier



A problem is:


Fβ=(1+β)2precisionrecallβ2precision+recallF_\beta = \frac{(1+\beta)^2\cdot\text{precision}\cdot\text{recall}}{\beta^2\cdot\text{precision}+\text{recall}}

Figure 10 - Precision-Recall Curve for Binary Classification Problem. Precision-Recall curves are good for imbalanced classification problems.