COMP4702 Lecture 3

Discuss Linear Regression and Logistic Regression

Linear Regression

Figure 1 - Car Stopping Distance with Linear Regression Model

y=θ0+θ1x1+θ2x2++θpxp+ε(3.2)y=\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_p x_p + \varepsilon \tag{3.2}

Training a Linear Regression Model


y=θTx+ε(3.7)y=\theta^T {\bf{x}} + \varepsilon\tag{3.7}

Optimisation of Linear Regression

θ^=argminθ1ni=1n(Ly^(xi;θ),yiloss function)cost function J(0)\hat{\theta}=\arg \min_\theta \underbrace{\frac{1}{n} \sum_{i=1}^{n} (\overbrace{\mathcal{L} {\hat{y}({\bf{x}}_i;{\bf{\theta}}), y_i}}^{\text{loss function}} )}_{\text{cost function J(0)}}

Least-Squares and the Normal Equations

J(θ)=1ni=1n(y^(xi;θ)yi)2=1ny^y22=1nXθy22=1nϵ22(3.11)\begin{aligned}J(\theta)&=\frac{1}{n}\sum_{i=1}^{n} (\hat{y}(\bf{x}_i;\theta)-y_i)^2\\&=\frac{1}{n} ||{\bf{\hat{y}}}-{\bf{y}}||_2^2 \\&= \frac{1}{n} ||{\bf{X}}\theta-{\bf{y}}||_2^2 \\&=\frac{1}{n}||\bf{\epsilon}||_2^2\tag{3.11}\end{aligned}

Figure 2 - Squared Error Loss Visualisation.

θ^=argminθ=1ni=1n(θTxiyi)2=argminθ1nXθy22(3.12)\hat{\theta}=\arg\min_\theta=\frac{1}{n}\sum_{i=1}^{n}(\theta^T{\bf{x}}_i-y_i)^2=\arg\min_\theta\frac{1}{n}||{\bf{X}}\theta-{\bf{y}}||_2^2\tag{3.12}
θ^=(XTX)1XTy(3.14)\hat{\bf{\theta}} = ({\bf{X}}^T{\bf{X}})^{-1} {\bf{X}}^T y \tag{3.14}

Maximum Likelihood Perspective

Turns out to be equivalent to the Sum of Squared Error perspective.



θ^=argmaxθp(yX;θ)(3.15)\hat{\theta} = \arg \max_\theta p({\bf{y}} | {\bf{X}} ; \theta)\tag{3.15}

 

ε N(0,σε2)(3.16)\varepsilon ~ \mathcal{N}(0, \sigma_\varepsilon^2) \tag{3.16}
p(yX;θ)=i=1np(y1xi,θ)(3.17) p({\bf{y}} | {\bf{X}};\theta)=\prod_{i=1}^n p(y_1 | {\bf{x}}_i, \theta)\tag{3.17}

 

p(yixi,θ)=N(yi;θTxi,σε2)=12πσε2exp(12σε2(θTxiyi)2)(3.18) \begin{align*} p(y_i | {\bf{x}}_i, \theta)&=\mathcal{N}(y_i;\theta^T {\bf{x}}_i, \sigma_\varepsilon^2)\\ &=\frac{1}{\sqrt{2\pi\sigma_\varepsilon^2}} \exp({-\frac{1}{2\sigma_\varepsilon^2} (\theta^T {\bf{x}_i} - y_i)^2}) \end{align*}\tag{3.18}
θ^=argmaxθp(yX;θ)=argmaxθi=1n(θTxiyi)2=argminθ1ni=1n(θTxiyi)2 \begin{align*} \hat{\theta} &=\arg\max_\theta p({\bf y}|{\bf{X}};\theta)\\ &=\arg\max_\theta -\sum_{i=1}^n(\theta^T {\bf{x}}_i-y_i)^2\\ &=\arg\min_\theta\frac{1}{n} \sum_{i=1}^n (\theta^T {\bf{x}}_i - y_i)^2\tag{3.21} \end{align*}

Observe that this is exactly the sum of squared error term we obtained earlier.

Linear Classification / Logistic Regression

How do we deal with categorical input variables using the linear regression model

Key Points


x={0if A,1if B(3.22) x= \begin{cases} 0&\text{if A},\\ 1&\text{if B} \end{cases} \tag{3.22}
x=[xAxBxCxD]T(3.24) {\bf{x}} = \begin{bmatrix} x_A & x_B & x_C & x_D \end{bmatrix}^T \tag{3.24}

Binary Classification Problem

Figure 3 - Decision Boundary of 2-class problem with two input variables.

Squashing Linear Regression

Figure 4 - Logistic Function

Classification Notation

p(y=1 | {\bf{x}}) \text{ is modelled by } g({\bf{x}}) \tag{3.26a}

p(y=-1|{\bf{x}})\text{ is modelled by } 1-g({\bf{x}}) \tag{3.26b}

[p(y=1x)p(y=2x)p(y=Mx)] is modelled by [g1(x)g2(x)gM(x)]=g(x)(3.27) \begin{bmatrix} p(y=1|{\bf{x}})\\ p(y=2|{\bf{x}})\\ \vdots\\ p(y=M|{\bf{x}})\\ \end{bmatrix} \text{ is modelled by } \begin{bmatrix} g_1({\bf{x}})\\ g_2({\bf{x}})\\ \vdots\\ g_M({\bf{x}}) \end{bmatrix} =\bf{g(x)}\tag{3.27}

Model Notation

z=θ0+θ1x1+θ2x2++θpxp=θTx(3.28) z = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_p x_p =\theta^T \bf{x}\tag{3.28}
g(x)=ez1+ez(3.29a’) g({\bf{x}}) = \frac{e^{z}}{1 + e^{z}}\tag{3.29a'}

* A modified version of Eq 3.29a from Lindholm et al

Training the Logistic Regression Model

In short, we can train the logistic regression model using the principle of maximum likelihood.
The only difference is that the model is denoted slightly differently.

J(θ)=1nlnp(yixi;θ)=1ni=1n{lng(xi;θ)if yi=1ln(1g(xi;θ))if yi=1Binary cross-entropy loss L(g(xi;θ),yi)(3.32) \begin{align*} J(\theta) &=-\frac{1}{n}\sum\ln p(y_i|{\bf{x}}_i;\theta)\\ &=\frac{1}{n}\sum_{i=1}{n} \underbrace{ \begin{cases} -\ln g({\bf{x}}_i;\theta)&\text{if }y_i=1\\ -\ln (1-g({\bf{x}}_i;\theta))&\text{if }y_i=-1\\ \end{cases}}_{\text{Binary cross-entropy loss }\mathcal{L}(g({\bf{x}_i;\theta}),y_i)} \end{align*}\tag{3.32}

Multi-Class Logistic Regression

softmax(z)=Δ1m=1Mezm[ez1ez2ezM]T(3.41) \text{softmax}({\bf{z}})\overset{\Delta}{=} \frac{1}{\sum_{m=1}^{M} e^{z_m}} \begin{bmatrix} e^{z_1} & e^{z_2} & \cdots e^{z_M} \end{bmatrix}^T \tag{3.41}
g(x)=softmax(z),             where z=[θ1Txθ2TxθMTx](3.42) {\bf{g}}({\bf{x}})=\text{softmax}({\bf{z}}), \ \ \ \ \ \ \ \ \ \ \ \ \ \text{where }\bf{z}= \begin{bmatrix} \theta_1^T{\bf{x}}\\ \theta_2^T{\bf{x}}\\ \vdots\\ \theta_M^T{\bf{x}}\\ \end{bmatrix} \tag{3.42}
gm(x)=eθmTxj=1MeθjTx,                m=1,,M(3.43) g_m({\bf{x}})=\frac{e^{\theta^T_m {\bf{x}}}}{\sum_{j=1}{M} e^{\theta^T_j {\bf{x}}}}, \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \forall m=1, \cdots, M \tag{3.43}
J(θ)=1ni=inlngyi(xi;θ)Multi-class cross-entropy loss(3.44) J(\theta)=\frac{1}{n}\sum_{i=i}^{n} \underbrace{-\ln g_{y_i} ({\bf{x}}_i;\theta)}_{\text{Multi-class cross-entropy loss}}\tag{3.44}
J(θ)=1ni=1n(θyiTxi+lnj=1MeθjTxi)(3.45)J(\theta)=\frac{1}{n}\sum_{i=1}^{n}(-\theta_{y_i}^T {\bf{x}}_i + \ln \sum_{j=1}^{M} e^{\theta_j^T{\bf{x}}_i})\tag{3.45}
Figure 4 - Learning the car stopping distance with linear regression, second-order polynomial regression and 10th order polynomial regression. From this, we can see that the 10th degree polynomial is over fitting to outliers in the data making it less useful than even ordinary linear regression (blue).
- On the training set, we can prove that the 10th order polynomial has the lowest error / is the most accurate on the training set. - However, this is not necessarily a good thing - the 10th order polynomial is over fitting to the data.