COMP4702 Chapter

A summary of Lindholm Chapter 3

Note With Linear Regression, guaranteed to get the optimal solution (can be calculated in a single step using the closed form) but Decision Trees not guaranteed to get optimal solution (might get it by luck but not guaranteed)

Linear Regression


y=f(x)+ε(3.1)y = f(\bf{x}) + \varepsilon \tag{3.1}

Linear Regression Model

y=θ0+θ1x1+θ2x2++θpxp+ε(3.2)y=\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_p x_p + \varepsilon \tag{3.2}
y=θ0+θ1x1+θ2x2++θpxp+ε=[θ0θ1θp]+[1x1xp]+ε=θTx+ε(3.3) \begin{align*} y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_p x_p + \varepsilon = \begin{bmatrix} \theta_0 & \theta_1 & \cdots & \theta_p \end{bmatrix} + \begin{bmatrix} 1\\x_1\\\vdots\\x_p \end{bmatrix} + \varepsilon = \theta^T {\bf{x}} + \varepsilon \end{align*} \tag{3.3}

Note In this context, x\bf{x} is used to denote the input vector matrix, both with and without the leading constant 1.
Additionally note that ε\bf{\varepsilon} is a pp-vector, of error / noise terms.


[1] Affine Linear function plus constant offset

Figure 1 - Linear Regression with p=1. Black dots represent data-points, blue line represents learned regression model. Model doesn't fit data perfectly, so remaining error corresponding to random noise ε for each data-point is shown in red. The model can be used to *predict* (blue circle) the output ŷ=𝐱★ for a test input 𝐱★

y^(x)=θ^0+θ^1x1+θ^2x2++θ^pxp=θ^ Tx(3.4)\hat{y}(\bf{x_\star})=\hat{\theta}_0 + \hat{\theta}_1 x_{\star 1} + \hat{\theta}_2 \bf{x}_{\star 2} + \cdots + \hat{\theta}_p x_{\star p} = \hat{\bf{\theta}}^{\ T} \bf{x_\star} \tag{3.4}

Training a Linear Regression Model

X=[x1T  x2T    xnT],y=[y1  y2    yn]T(3.5){\bf{X}}=[\bf{x}_1^T\ \ \bf{x}_2^T\ \ \cdots \ \ \bf{x}_n^T], {\bf{y}}=[y_1\ \ y_2\ \ \cdots \ \ y_n]^T\tag{3.5}

Defining a Loss Function


θ^=argminθ1ni=1nL(y^(xi;θ),yi)(3.9)\hat\theta = \arg\min_\theta \frac{1}{n} \sum_{i=1}^n L(\hat{y}(\bf{x}_i; \theta), y_i)\tag{3.9}

Least Squares and the Normal Equations

L(y^(x;θ),y)=(y^(x;θ)y)2(3.10)L(\hat{y}(\bf{x}; {\bf{\theta}}), y)=(\hat{y}({\bf{x};{\bf{\theta}}})-y)^2 \tag{3.10}
J(θ)=1ni=1n(y^(xi;θ)yi)2=1ny^y22=1nXθy22=1nϵ22(3.11)\begin{align*}J(\theta)&=\frac{1}{n} \sum_{i=1}^{n} (\hat{y}({\bf{x}}_i ; \theta) - y_i)^2 \\ &= \frac{1}{n} ||\hat{\bf{y}} - {\bf{y}} ||_2^2 \\ &= \frac{1}{n} ||{\bf{X}}\theta-{\bf{y}}||_2^2\\ &= \frac{1}{n} ||\epsilon||_2^2\end{align*}\tag{3.11}

θ^=argminθ1ni=1n(θTxiyi)2=argminθ1nXθy22\begin{align*}\hat{\theta} &= \arg\min_{\theta} \frac{1}{n} \sum_{i=1}^n (\theta^T {\bf{x}}_i - y_i)^2 \\ &=\arg\min_\theta \frac{1}{n} || {\bf{X}}\theta-{\bf{y}}||_2^2 \tag{3.12}\end{align*}
XTXθ^=XTy(3.13){\bf{X}}^T {\bf{X}}\hat{\theta} = {\bf{X}}^T {\bf{y}}\tag{3.13}
Figure 2 - Graphical Representation of Squared Error Loss Function.
θ^=(XTX)1XTy(3.14)\hat{\bf{\theta}} = ({\bf{X}}^T{\bf{X}})^{-1} {\bf{X}}^T y \tag{3.14}

Goal Learn linear regression using squared error loss.

Data Training data T={xi,yi}i=1n\mathcal{T} = \lbrace {\bf{x}_i , y_ i}\rbrace_{i=1}^{n} Result Learned parameter vector, θ\theta

  1. Construct the matrix X\bf{X} and vector y\bf{y} according to Eq 3.5
  2. Compute θ^\hat{\bf{\theta}} by solving Eq 3.13

Goal Predict with linear regression Data Learned parameter vector θ\bf{\theta} and test input x\bf{x_\star} Result Prediction y^x\hat{y}{\bf{x_\star}}

  1. Compute y^(x)=θ^Tx\hat{y}({\bf{x_ \star}}) = \hat{\bf{\theta}}^T \bf{x_ \star}

Maximum Likelihood Perspective (Derivation of Least-Squares)

θ^=argmaxθp(yX;θ)(3.15)\hat{\theta} = \arg \max_\theta p({\bf{y}} | {\bf{X}} ; \theta)\tag{3.15}
ε N(0,σε2)(3.16)\varepsilon ~ \mathcal{N}(0, \sigma_\varepsilon^2) \tag{3.16}

Stochastic Having a random probability distribution or pattern that may be analysed statistically but may not be predicted precisely.

p(yX;θ)=i=1np(y1xi,θ)(3.17) p({\bf{y}} | {\bf{X}};\theta)=\prod_{i=1}^n p(y_1 | {\bf{x}}_i, \theta)\tag{3.17}
p(yixi,θ)=N(yi;θTxi,σε2)=12πσε2exp(12σε2(θTxiyi)2)(3.18) \begin{align*} p(y_i | {\bf{x}}_i, \theta)&=\mathcal{N}(y_i;\theta^T {\bf{x}}_i, \sigma_\varepsilon^2)\\ &=\frac{1}{\sqrt{2\pi\sigma_\varepsilon^2}} \exp({-\frac{1}{2\sigma_\varepsilon^2} (\theta^T {\bf{x}_i} - y_i)^2}) \end{align*}\tag{3.18}
lnp(yX;θ)=i=1nlnp(yixi,θ)(3.19)\ln p ({\bf{y}} | {\bf{X}} ; \theta) = \sum_{i=1}^n \ln p (y_i | {\bf{x}}_i, \theta) \tag{3.19}
\ln p({\bf{y}} | {\bf{X}} ; \theta) = -\frac{n}{2}\ln(2\pi\sigma_\varepsilon^2) \sigma{i=1}^n (\theta^T {\bf{x}}_i-y_i)^2 \tag{3.20}
\color{gray}\hat{\theta} = \arg \max_\theta p({\bf{y}} | {\bf{X}} ; \theta)\tag{3.15}
\begin{align*}
\hat{\theta} 
&=\arg\max_\theta -\sum_{i=1}^n(\theta^T {\bf{x}}_i-y_i)^2\\
&=\arg\min_\theta\frac{1}{n} \sum_{i=1}^n (\theta^T {\bf{x}}_i - y_i)^2\tag{3.21}
\end{align*}

Categorical Input Variables

x=\begin{cases}
0&\text{if A}\\
1&\text{if B}\\
\end{cases}\tag{3.22}
y=\theta_0 + \theta_1x + \varepsilon=
\begin{cases}
\theta_0+\varepsilon&\text{if A}\\
\theta_0 + \theta_1 + \varepsilon&\text{if B}\\
\end{cases}\tag{3.23}
{\bf{x}} 
  = \begin{bmatrix}
    x_A & x_B & x_C & x_D
    \end{bmatrix}^T
\tag{3.24}

Classification and Logistic Regression

Statistical View of Classification

p ( y = m | {\bf{x}})\tag{3.25}

Example - Voting Behaviour using Probabilities

\begin{align*}
  p(y=\text{cerise party} | {\bf{x}}=\text{46 year old women}) = 0.13\\
  p(y=\text{turquoise party} | {\bf{x}}=\text{46 year old women}) = 0.39\\
  p(y=\text{purple party} | {\bf{x}}=\text{46 year old women}) = 0.48
\end{align*}

p(y=1 | {\bf{x}}) \text{ is modelled by } g({\bf{x}}) \tag{3.26a}
p(y=-1|{\bf{x}})\text{ is modelled by } 1-g({\bf{x}}) \tag{3.26b}
[p(y=1x)p(y=2x)p(y=Mx)] is modelled by [g1(x)g2(x)gM(x)]=g(x)(3.27) \begin{bmatrix} p(y=1|{\bf{x}})\\ p(y=2|{\bf{x}})\\ \vdots\\ p(y=M|{\bf{x}})\\ \end{bmatrix} \text{ is modelled by } \begin{bmatrix} g_1({\bf{x}})\\ g_2({\bf{x}})\\ \vdots\\ g_M({\bf{x}}) \end{bmatrix} =\bf{g(x)}\tag{3.27}

Logistic Regression Model for Binary Classification

z=θ0+θ1x1+θ2x2++θpxp=θTx(3.28) z = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_p x_p =\theta^T \bf{x}\tag{3.28}
Figure 3  - Plot of the logistic function, given by h(z)=(e^z)/(1+e^z)$

g(x)=eθTx1+eθTx(3.29a) g({\bf{x}}) = \frac{e^{\theta^T {\bf{x}}}}{1 + e^{\theta^T {\bf{x}}}}\tag{3.29a}
p(y=1x)=1g(x)=1eθTx1+eθTx=11+eθTx=eθTx1+eθTx(3.29b) p(y=-1 | {\bf{x}}) = 1 - g({\bf{x}}) = 1 - \frac{e^{\theta^T {\bf{x}}}}{1 + e^{\theta^T {\bf{x}}}} = \frac{1}{1+e^{\theta^T {\bf{x}}}} = \frac{e^{-\theta^T {\bf{x}}}}{1 + e^{-\theta^T {\bf{x}}}} \tag{3.29b}

Training Logistic Regression Model by Maximum Likelihood

θ^=argmaxθp(yX;θ)=argmaxθi=1nlnp(yixi;θ)(3.30) \hat{\theta}=\arg\max_\theta p({\bf{y}} | {\bf{X}} ; \theta) = \arg\max_\theta \sum_{i=1}^n \ln p(y_i | {\bf{x}}_i; \theta)\tag{3.30}
lnp(yixi;θ)={lng(xi;θ)if yi=1ln(1g(xi;θ))if yi=1(3.31) \ln p(y_i|{\bf{x}_i};\theta)= \begin{cases} \ln g({\bf{x}}_i;\theta)&\text{if }y_i=1\\ \ln(1-g({\bf{x}}_i;\theta))&\text{if }y_i=-1 \end{cases}\tag{3.31}

g(xi;θ)=eθTxi1+eθTxi=eyiθTxi1+eyiθTxi(3.33a) g({\bf{x}}_i;\theta) =\frac {e^{\theta^T{\bf{x}_i}}} {1+e^{\theta^T{\bf{x}_i}}} =\frac {e^{y_i \theta^T {\bf{x}_i}}} {1+e^{y_i \theta^T {\bf{x}_i}}} \tag{3.33a}
J(θ)=1Ni=1nlneyiθTxi1+eyiθTxi=1ni=1nln11+eyiθTxi=1ni=1nln(1+eyiθTxi)Logistic loss L(xi,yi,θ)(3.34) \begin{align*} J(\theta)&=\frac{1}{N} \sum_{i=1}^{n}-\ln\frac{e^{y_i \theta^T {\bf{x}_i}}}{1+e^{y_i \theta^T {\bf{x}_i}}} =\frac{1}{n}\sum_{i=1}{n}-\ln \frac{1}{1 + e^{-y_i \theta^T {\bf{x_i}}}}\\ &=\frac{1}{n}\sum_{i=1}{n} \underbrace{\ln (1 + e^{-y_i \theta^T {\bf{x_i}}})} _{\text{Logistic loss } \mathcal{L}({\bf{x_i}}, y_i, \theta)} \end{align*} \tag{3.34}

θ^=argminθ1ni=1nln(1+eyiθTxi)(3.35)\hat\theta=\arg\min_\theta\frac{1}{n}\sum_{i=1}{n}\ln(1+e^{-y_i \theta^T {\bf{x_i}}})\tag{3.35}

Logistic Regression Predictions and Decision Boundaries

Thus far have developed a method for predicting the probabilities for each class for some test input x\bf{x}_\star.

y^(x)={1if g(x)>r1if g(x)r(3.36) \hat{y}({\bf{x}_\star})= \begin{cases} 1 & \text{if } g({\bf{x}})>r\\ -1 & \text{if }g({\bf{x}})\le r \end{cases}\tag{3.36}

Logistic Regression Pseudocode

Learn binary logistic regression

Data: Training data T={xi,yi}i=1n\mathcal{T}=\lbrace{\bf{x}}_i, y_i\rbrace_{i=1}^{n} with output classes y={1,1}y=\lbrace-1,1\rbrace
Result Learned parameter vector θ^\hat\theta

  1. Compute θ^\hat{\theta} by solving (Eq 3.35) numerically

Predict with binary logistic regression

Data Learned parameter vector θ^\hat\theta and test input x\bf{x}_\star
Result Prediction y^(x)\hat{y}({\bf{x}_\star})

  1. Compute g(x)g({\bf{x}_\star}) using (Eq. 3.29)
  2. If g(x)>0.5g({\bf{x}_\star})>0.5 then return y^(x)=1\hat{y}({\bf{x}_\star})=1 else return y^(x)=1\hat{y}({\bf{x}_\star})=-1

Logistic Regression with Multiple Classes


softmax(z)=Δ1m=1Mezm[ez1ez2ezM]T(3.41) \text{softmax}({\bf{z}})\overset{\Delta}{=} \frac{1}{\sum_{m=1}^{M} e^{z_m}} \begin{bmatrix} e^{z_1} & e^{z_2} & \cdots e^{z_M} \end{bmatrix}^T \tag{3.41}
g(x)=softmax(z),             where z=[θ1Txθ2TxθMTx](3.42) {\bf{g}}({\bf{x}})=\text{softmax}({\bf{z}}), \ \ \ \ \ \ \ \ \ \ \ \ \ \text{where }\bf{z}= \begin{bmatrix} \theta_1^T{\bf{x}}\\ \theta_2^T{\bf{x}}\\ \vdots\\ \theta_M^T{\bf{x}}\\ \end{bmatrix} \tag{3.42}

gm(x)=eθmTxj=1MeθjTx,                m=1,,M(3.43) g_m({\bf{x}})=\frac{e^{\theta^T_m {\bf{x}}}}{\sum_{j=1}{M} e^{\theta^T_j {\bf{x}}}}, \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \forall m=1, \cdots, M \tag{3.43}

Training Logistic Regression with Multiple Classes

J(θ)=1n]i=inlngyi(xi;θ)Multi-class cross-entropy loss(3.44) J(\theta)=\frac{1}{n}]\sum_{i=i}^{n} \underbrace{-\ln g_{y_i} ({\bf{x}}_i;\theta)}_{\text{Multi-class cross-entropy loss}}\tag{3.44}
J(θ)=1ni=1n(θyiTxi+lnj=1MeθjTxi)(3.45)J(\theta)=\frac{1}{n}\sum_{i=1}^{n}(-\theta_{y_i}^T {\bf{x}}_i + \ln \sum_{j=1}^{M} e^{\theta_j^T{\bf{x}}_i})\tag{3.45}

Polynomial Regression and Regularisation



X=[14.016.014.924.015.025.0139.61568.2139.71576.1],           θ=[θ0θ1θ2],            y=[4.08.08.0134.0110.0](3.47) \begin{align*} X=\begin{bmatrix} 1 & 4.0 & 16.0 \\ 1 & 4.9 & 24.0 \\ 1 & 5.0 & 25.0 \\ \vdots & \vdots & \vdots \\ 1 & 39.6 & 1568.2 \\ 1 & 39.7 & 1576.1 \\ \end{bmatrix}, \ \ \ \ \ \ \ \ \ \ \ \theta = \begin{bmatrix}\theta_0\\\theta_1\\\theta_2\end{bmatrix}, \ \ \ \ \ \ \ \ \ \ \ \ y = \begin{bmatrix} 4.0 \\ 8.0 \\ 8.0 \\ \vdots \\ 134.0 \\ 110.0 \end{bmatrix} \end{align*} \tag{3.47}

Figure 4

Learning the car stopping distance with linear regression, second-order polynomial regression and 10th order polynomial regression. From this, we can see that the 10th degree polynomial is over fitting to outliers in the data making it less useful than even ordinary linear regression (blue).

Regularisation


L2 Regularisation

θ^=argminθ1nXθy22+λθ22(3.48) \hat\theta=\arg\min_\theta\frac{1}{n} || {\bf{X}}\theta-y||_2^2 + \lambda ||\theta||_2^2 \tag{3.48}
(XTXnλIp+1)θ^=XTy(3.49)\def\x{{\bf{X}}} (\x^T \x _ n\lambda {\bf{I}}_{p+1})\hat\theta=\x^T y\tag{3.49}

θ^=argminθ1ni=1nln(1+exp(yiθTxi))+λθ22(3.50)\hat\theta=\arg\min_\theta\frac{1}{n}\sum_{i=1}^{n} \ln (1 + \exp(-y_i \theta^T {\bf{x}}_i)) + \lambda || \theta||_2^2\tag{3.50}

Generalised Linear Models

This section has been omitted as it is not part of the assessable content for this course