Lindholm22 Chapter 2

A summary of Lindholm22 Chapter 2: Introduce the supervised machine learning problem as well as two basic Machine Learning methods for solving it.

  1. k-nearest neighbours
  2. decision trees

Supervised Machine Learning

Learning from Labelled Data

x=\begin{bmatrix}x_1&x_2&\cdots &x_p\end{bmatrix}^T \tag{1}

Numerical and Categorical Variables

\bf{x}=\begin{bmatrix}x_1&x_2&\cdots&x_p\end{bmatrix}^T \tag{2}

Classification and Regression

Regression: Output is numerical Classification: Output is categorical

\bf{x}=\begin{bmatrix}x_1&x_2&\cdots&x_p\end{bmatrix}^T \tag{3}
Possible Output Values M
{false, true}\{\text{false, true}\} 2
{Sweden, Norway, Finland, Denmark}\{\text{Sweden, Norway, Finland, Denmark}\} 4

Example - Classifying Songs


song classification - perceived energy vs length
Figure 2.1 - Perceived Energy vs Length of a song

Note With many machine learning tasks, it is not reasonable to get 100% accuracy - looking at the scatter plot, there aren't enough good features to separate the song classes. We could achieve higher accuracy (greater separation between classes) with more features, or better features.

Generalising Beyond Training Data

Two reasons why it can be of interest to mathematically model the input-output relationships from training data.

  1. To reason about and explore how input and output variables related. e.g. is there correlation between a pair of variables.
  2. To predict the output value yy_\star for some new, previously unseen input x\bf{x_\star}. Using some mathematical method that generalises the input-output examples seen in the training data, we can make a prediction y^(x)\hat{y}(\bf{x\star}) for some previously unseen x\bf{x_\star}. Note here that the hat symbol ^\hat{} denotes that the prediction is an estimate of the output.

k-Nearest Neighbours

||\bf{x}_ i-\bf{x_ \star}||\ \forall i\in\lbrace 1,\cdots,n\rbrace\tag{4}
  1. Find data point xj\bf{x}_j in the training set with the shortest distance to x\bf{x_\star} and use it's output as the prediction:
\hat{y}(\bf{x_\star})=y_j \tag{5}
\mathcal{N}_ \star=\{i: \bf{x_ \star \text{ one of the }k\text{ training data points closest to } \bf{x_ \star}}\} \tag{6}

Note: We don't care about the model's performance on the dataset used for training - we want to evaluate the model on data it hasn't seen before (testing the model's ability to generalise).


The kk -Nearest Neighbours can be summarised as follows:


Data: Training data, denoted {xi,yi}i=1n\{\bf{x}_ i, y_ i\}_ {i=1}^{n} with test input denoted x\bf{x_ \star}

Result: Predicted test output, y^(x)\hat{y}(\bf{x_ \star})

  1. Compute the distances xix2||\bf{x_ i}-\bf{x_ \star}||_2 for all training points i=1,,ni=1, \cdots, n
  2. Let N={i:xi is one of the k data points closest to x}\mathcal{N}_ \star=\lbrace i:\bf{x}_ i \text{ is one of the }k \text{ data points closest to }\bf{x_ \star}\rbrace
  3. Compute the prediction y^(x)\hat{y}(\bf{x_ \star}) using the following formula:
\hat{y}(\bf{x_ \star})=\begin{cases}\text{Average}\{y_ j:j\in\mathcal{N}_ \star\}&\text{(Regression problems)}\\ \text{MajorityVote} \lbrace y_ j : j \in \mathcal{N}_ \star\rbrace  &\text{(Classification problems)}\end{cases}\tag{7}

Example - Predicting Colours with kk-NN

ii xix_i x2x_2 yy
1 -1 3 Red\text{Red}
2 2 1 Blue\text{Blue}
3 -2 2 Red\text{Red}
4 -1 2 Blue\text{Blue}
5 -1 0 Blue\text{Blue}
6 1 1 Red\text{Red}
\bf{x_\star}=\begin{bmatrix}1&2\end{bmatrix} \tag{8}
ii xix2||\bf{x_i}-\bf{x_ \star}||_2 yiy_i
6 1\sqrt{1} Red\text{Red}
2 2\sqrt{2} Blue\text{Blue}
4 4\sqrt{4} Blue\text{Blue}
1 5\sqrt{5} Red\text{Red}
5 8\sqrt{8} Blue\text{Blue}
3 9\sqrt{9} Red\text{Red}

Figure 2.3 - KNN Prediction for $k=1$

Figure 2.4 - Decision Boundary for $k$-NN when $k=1$ (left) and $k=3$ (right)

Choosing k


Figure 2.5 - $k$-NN applied to the music classification example (above) and car stopping example (below) with $k=1$ and $k=20$.

Input Normalisation

||\bf{x_i}-\bf{x_\star}||_2=\sqrt{(x_{i1}-x_{\star 1})^2 + (x_{i2}-x_{\star 2})^2}.
x_{ij}^{\text{new}}=\frac{x_{ij}-\min_{\ell}(x_{\ell j})}{\max_\ell (x_{\ell j})-\min_\ell(x_{\ell j})},\ \ \ \ \ \ \ \  \forall j\in\lbrace 1, \cdots, p\rbrace,\ \  i\in \lbrace 1, \cdots, n\rbrace
x_{ij}^{\text{new}} = \frac{x_{ij}-\bar{x}_j}{\sigma_j}, \ \ \ \ \ \ \ \ \forall j\in\lbrace 1, \cdots, p \rbrace, \ \ i \in \lbrace 1, \cdots, n \rbrace

Decision Trees


Example - Predicting Colours with a Decision Tree

Figure 3 - Predicting colours using a decision tree. (Left) Pre-defined decision tree. (Right) Prediction partition of input space
function predict(x_1, x_2):
  if x_2 < 3.0:
    return Blue
  else:
    if x_1 < 5.0:
      return Blue
    else:
      return Red
Figure 4 - Pseudocode for predicting a test input with the tree in Figure 3.

Learning a Regression Tree

\hat{y}(\bf{x_\star})=\sum_{\ell=1}^{L} \hat{y}_\ell \mathbb{I}\lbrace \bf{x}_\star \in R_\ell \rbrace
\hat{y}_\ell=\text{Average}\lbrace y_i : \bf{x}_i \in R_\ell\rbrace


R_1(j,s)=\{\bf{x} | x_j < s\}\ \ \ \ \ \ \ \ 
\text{and}
\ \ \ \ \ \ \ \ \ R_2(j,s)=\{\bf{x} | x_j \ge s\}\tag{2.4}
\hat{y}_1(j,s)=\text{Average}\{y_i:\bf{x}_i\in R(j,s)\}
\ \ \ \ \ \ \ \ \ \ \ \text{and}
\ \ \ \ \ \ \ \ \ \ \ \hat{y}_2(j,s)=\text{Average}\{y_i : \bf{x}_i \in R_2(j,s)\}
\sum_{i: \bf{x}_i\in R_1(j,s)} (y_i-\hat{y}_1(j,s))^2 + \sum_{i: \bf{x}_i\in R_2(j,s)} (y_i-\hat{y}_2(j,s))^2\tag{2.5}

Finding the Optimal Split


Classification Trees

y^=MajorityVote{yi:xR} \hat{y}_\ell = \text{MajorityVote}\{y_i :\bf{x}\in R_\ell\}
argminj,sn1Q1+n2Q2(2.6) \arg\min_{j, s} n_1 Q_1 + n_2 Q_2 \tag{2.6}
Generalisation to Classification Case
\hat{\pi}_{\ell m}=\frac{1}{n_\ell} \sum_{i : \bf{x}_i\in R_\ell} \mathbb{I}\{y_i=m\}
Q_\ell=1-\max_{m} \hat{\pi}_{\ell m} \tag{2.7a}
Q_\ell =\sum_{m=1}^M \hat{\pi}_{\ell m} (1-\hat{\pi}_{\ell m}) \tag{2.7b}
Q_\ell=-\sum_{m=1}^M \hat{\pi}_{\ell m} \ln \hat{\pi}_{\ell m} \tag{2.7c}

Learning a Decision Tree

Goal: Learn a decision tree using recursive binary splitting: Data: Training data T={xi,yi}i=1n\mathcal{T}=\{\bf{x}_i, y_i\}_{i=1}^n Result: Decision tree with regions R1,,RLR_1, \cdots, R_L and corresponding predictions y^1,,y^L\hat{y}_1, \cdots, \hat{y}_L

  1. Let RR denote the whole input space
  2. Compute the regions (R1,,RL)=Split(R,T)(R_1, \cdots, R_L)=\text{Split}(R, \mathcal{T})
  3. Compute the predictions y^\hat{y}_\ell for {1,,L}\ell\in\{1, \cdots, L\} as:
    \hat{y}_\ell=\begin{cases}
        \text{Average}\{y_i:\bf{x}_i \in R \ell\} & \text{(Regression Problems)}\\
        \text{MajorityVote}\{y_i : \bf{x}_i\in R\ell \} &\text{(Classification Problems)}\\
    \end{cases}
    
Function Split(R, 𝒯):
    if (stopping criterion fulfilled) 
        return r
    else
        Go through all possible splits xⱼ < s for all input variables
          j = 1, ..., p
        Pick the pair (j,s) that minimises the loss function for regression/classification problems
        Split the region R into R₁ and R₂ according to Eq 2.4
        Split data 𝒯 into 𝒯₁ and 𝒯₂ respectively 
        return Split(R₁, 𝒯₁), Split(R₂, 𝒯₂)

Predicting from a Decision Tree

Data:

  1. Find the region RR_\ell to which x\bf{x_\star} belongs.
  2. Return the prediction y^(x)=y^\hat{y}(\bf{x_\star})=\hat{y}_\ell

Example - Learning a Classification Tree

FIgure 2.8 - Example 2.6 data
R_1=x_1<2.5, R_2=x_1\ge 2.5
Q_1= -\hat{\pi}_{1B} \ln(\hat{\pi}_{1B})-\hat{\pi}_{1R} \ln(\hat{\pi}_{1R})=-\frac{2}{3}\ln(\frac{2}{3})-\frac{1}{3}\ln(\frac{1}{3})=0.64
Q_2= -\hat{\pi}_{2B} \ln(\hat{\pi}_{2B})-\hat{\pi}_{2R} \ln(\hat{\pi}_{2R})=-\frac{3}{7}\ln(\frac{3}{7})-\frac{4}{7}\ln(\frac{4}{7})=0.68
n_1 Q_1 + n_2 Q_2 = 3\cdot 0.64 + 7 \cdot 0.68 = 6.69