COMP4702 Lecture 5

Principles of Parametric Modelling

θ^=argminθ1ni=1nL(yi,fθ(i))loss functioncost function J(θ)(5.4)\hat\theta=\arg\min_\theta\underbrace{\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}\overbrace{(y_i,f_\theta({\bf}_i))}^{\text{loss function}}}_{\text{cost function } J(\theta)}\tag{5.4}

Loss Functions and Likelihood-Based models


Figure 1 - Loss Functions for regression presented, as a function of error

y^(x)=sign{f(x)}=sign{θTx}\hat{y}(\bf{x})=\text{sign}\lbrace f({\bf{x}}) \rbrace = \text{sign} \lbrace {{\bf{\theta}}^T{\bf{x}}}\rbrace

Figure 2 - Loss Functions for Classification

Note: We leave the next two sections of the book as optional reading

Regularisation

Explicit Regularisation

θ^=argminθ1nXθy22+λθ22(5.22)\hat\theta=\arg\min_\theta\frac{1}{n} || {\bf X\theta-y}||_2^2 + \lambda || \theta ||_2^2\tag{5.22}
θ^=argminθ1nXθy22+λθ1(5.25)\hat\theta=\arg\min_\theta\frac{1}{n} || {\bf X\theta-y}||_2^2 + \lambda || \theta ||_1\tag{5.25}

Implicit Regularisation

Parameter Optimisation

Examples of Objective Functions

Figure 3 - Example of Objective Functions plotted in 3d

Figure 4 - Contour plots of functions shown in Figure 3

Gradient Descent


(A note on the optimisation of parameters from first principles)

θ^=argminθ1nXθy22(3.12)\hat\theta=\arg\min_\theta\frac{1}{n} ||{\bf{X}} \theta-{\bf{y}}||_2^2 \tag{3.12}
θ^=argminθJ(θ)(5.28)\hat\theta=\arg\min_\theta J(\theta) \tag{5.28}
θJ(θ)=1ni=1n(11+eyiθTxi)yixi(5.29)\nabla_\theta J(\theta)=-\frac{1}{n}\sum_{i=1}^{n} (\frac{1}{1 + e^{y_i \theta^T {\bf{x}_i}}}) y_i {\bf{x}}_i\tag{5.29}

Gradient Descent Algorithm

  1. Set t0t\leftarrow 0
  2. While θ(t)θ(t1)||\theta^{(t)} - \theta^{(t-1)}|| not small enough, do
  3.     Update θ(t+1)θ(t)γθJ(θ(t))\theta^{(t+1)}\leftarrow\theta^{(t)}-\gamma\nabla_\theta J(\theta^{(t)})
  4.     Update tt+1t\leftarrow t+1
  5. end
  6. return θ^θ(t1)\hat\theta\leftarrow\theta^{(t-1)}

Figure 5 - Effect of learning rate on finding global minima


Gradient Descent Example

Example 5.5

Figure 6 - Convex function

Figure 7 - Non-Convex function

Other Methods

Early Stopping with Logistic Regression

An example of regularisation by early stopping

θ0+θ1x1+θ2x2+θ3x12+θ4x1x2+θ5x22++θ299x1x219+θ230x2230\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_1x_2 + \theta_5 x_2^2 + \cdots + \theta_{299} x_1 x_2^{19} + \theta_{230} x_2^{230}

Figure 8 - Hold-out error

Optimisation with Large Datasets

θ(t+1)=θ(t)γ1n/2i=1n2θL(xi,yi,θ(t))(5.37a) \theta^{(t+1)}=\theta^{(t)} - \gamma \frac{1}{n/2} \sum_{i=1}^{\frac{n}{2}} \nabla_\theta\mathcal{L}({\bf{x}}_i, {\bf{y}}_i, \theta^{(t)})\tag{5.37a}
θ(t+2)=θ(t+1)γ1n/2i=n2+1nθL(xi,yi,θ(t+1))(5.37b) \theta^{(t+2)}=\theta^{(t+1)} - \gamma \frac{1}{n/2} \sum_{i=\frac{n}{2}+1}^{n} \nabla_\theta\mathcal{L}({\bf{x}}_i, {\bf{y}}_i, \theta^{(t+1)})\tag{5.37b}

Hyperparameter Optimisation