Lindholm et al, Chapter 5

Learning Parametric Models

Principles of Parametric Modelling

y=fθ(x)+ε(5.1)y=f_\theta({\bf{x}})+\varepsilon\tag{5.1}
g(x)=efθ(x)1+efθ(x)(5.3)g({\bf x}) = \frac{e^{f_\theta({\bf x})}}{1+e^{f_\theta({\bf x})}}\tag{5.3}
θ^=argminθ1ni=1nL(yi,fθ(xi))loss functioncost function J(θ)\hat\theta=\arg\min_\theta\underbrace{\frac1n \sum_{i=1}^{n} \overbrace{L(y_i, f_\theta({\bf x_i}))}^{\text{loss function}}}_{\text{cost function } J(\theta)}

Loss Functions and Likelihood-Based Models

Loss Functions for Regression

L(y,y^)=(y^y)2(5.6)L(y, \hat y) = (\hat y - y)^2 \tag{5.6}
L(y,y^)=y^y(5.7)L(y,\hat y)=|\hat y - y | \tag{5.7}
L(y,y^)={12(y^y)2 if y^y<1,y^y12 otherwise.(5.8) L(y,\hat y) = \begin{cases} \frac 12 (\hat y - y)^2 & \text{ if } |\hat y - y| < 1, \\ |\hat y - y| - \frac12 & \text{ otherwise.} \end{cases} \tag{5.8}
L(y,y^)={0 if y^y<ϵ,y^yϵ otherwise.(5.9) L(y,\hat y) = \begin{cases} 0 & \text{ if } |\hat y - y| < \epsilon, \\ |\hat y - y| - \epsilon & \text{ otherwise.} \end{cases} \tag{5.9}

Loss Functions for Binary Classification

L(y,y^)=I{y^y}={0if y^=y1if y^y(5.10)L(y,\hat y)=\mathbb{I}\lbrace\hat y \ne y\rbrace =\begin{cases}0&\text{if }\hat y = y \\ 1 & \text{if }\hat y \ne y\end{cases}\tag{5.10}
L(y,g(x))={lng(x) if y=1ln(1g(x))if y=-1(5.11)L(y, g({\bf x})) = \begin{cases}\ln g({\bf x}) & \text{ if } y=1 \\ \ln (1 - g({\bf x})) & \text{if y=-1}\end{cases}\tag{5.11}
y^(x)=sign{f(x)}(5.12)\hat y ({\bf x}) = \text{sign}\lbrace f ({\bf x}) \rbrace \tag{5.12}
L(yf(x))=ln(1+exp(yf(x)))(5.13) L(y \cdot f({\bf x})) = \ln (1 + \exp (-y \cdot f({\bf x}))) \tag{5.13}
L(yf(x))={1if yf(x)<0,0,otherwise(5.14)L(y \cdot f({\bf x})) = \begin{cases}1&\text{if } y \cdot f({\bf x}) < 0, \\ 0, & \text{otherwise}\end{cases} \tag{5.14}
L(yf(x))={(1yf(x))2for yf(x)10otherwise(5.17) L(y\cdot f({\bf x})) = \begin{cases}(1-y\cdot f({\bf x}))^2 &\text{for } y \cdot f({\bf x})\le 1\\0&\text{otherwise}\end{cases}\tag{5.17}
L(yf(x))={4yf(x)for yf(x)1(1yf(x))2for 1yf(x)1(squared hinge loss)0otherwise(5.18)L(y\cdot f({\bf x})) = \begin{cases} -4y\cdot f({\bf x}) &\text{for } y \cdot f({\bf x}) \le -1 \\ (1 - y \cdot f({\bf x}))^2 & \text{for } -1 \le y \cdot f({\bf x}) \le 1 \text{(squared hinge loss)} \\ 0 & \text{otherwise} \end{cases}\tag{5.18}
L(y,y^)={0ify^=y1if y^y and y=1Cif y^y and y=1(5.19)L(y,\hat y)=\begin{cases} 0 & \text{if} \hat y = y \\ 1 & \text{if } \hat y \ne y \text{ and } y = -1 \\ C & \text{if } \hat y \ne y \text{ and } y = 1 \end{cases}\tag{5.19}

Multi-Class Classification

Likelihood-Based Models and the Maximum Likelihood Approach

J(θ)=1ni=1nlnp(yixi;θ) J(\theta) = -\frac 1n \sum_{i=1}^{n} \ln p(y_i | {\bf x}_i ; \theta)
I have ignored a bunch of content here for the sake of time

Regularisation

L2 Regularisation

Also known as Tikhonov regularisation, weight decay, ridge regression, or shrinkage

θ^=argminθ1nXθy22+λθ22(5.22) \hat\theta=\arg\min_\theta\frac 1n || {\bf X}\theta-{\bf y}||_2^2 + \lambda ||\theta||_2^2 \tag{5.22}
(XTX+nλIp+1)θ^=XTy(5.23)({\bf X}^T{\bf X} + n \lambda {\bf I}_{p+1})\hat{\bf\theta} = {\bf X}^T {\bf y}\tag{5.23}
θ^=(XTX+nλIp+1)1XTy(5.24) \hat{\bf\theta}=({\bf X}^T {\bf X} + n \lambda {\bf I}_{p+1})^{-1} {\bf X}^T {\bf y} \tag{5.24}

L1 Regularisation

θ^=argminθ1nXθy22+λθ1(5.25)\hat\theta=\arg\min_\theta\frac 1n || {\bf X}\theta - {\bf y}||_2^2 + \lambda ||\theta||_1 \tag{5.25}

Implicit Regularisation

Parameter Optimisation

Gradient Descent

Useful when we don't have a closed-form solution but we have access to the value of the objective function and its derivative (gradient).

J(θγθJ(θ))J(θ)(5.30)J(\theta-\gamma\nabla_\theta J(\theta)) \le J(\theta) \tag{5.30}
update θ(t+1)=θ(t)γθJ(θ(t))(5.31) \text{update }\theta^{(t+1)} = \theta^{(t)} - \gamma\nabla_\theta J(\theta^{(t)}) \tag{5.31}

Input: Objective function J(θ)J(\theta), initial θ(0)\theta^{(0)}, learning rate γ\gamma

Result: θ^\hat\theta

  1. Set t=0t=0
  2. while θ(t)θ(t1)||\theta^{(t)} - \theta^{(t-1)} not small enough do
  3.  |   Update θ(t+1)θ(t)γθJ(θ(t))\theta^({t+1})\leftarrow \theta^{(t)} -\gamma\nabla_\theta J(\theta^{(t)})
  4.  |   Update tt+1t\leftarrow t+1
  5. end
  6. return θ^θ(t+1)\hat\theta\leftarrow\theta^{(t+1)}

Figure 5.7 - Effect of learning rate on gradient descent. (Left) Too high learning rate (Middle) Too high learning rate (Right) Good learning rate.

Second-Order Gradients

Omitted as not really covered in the lectures -> not relevant

Optimisation with Large Datasets (SGD)

θ(t+1)=θ(t)γ1n/2i=1n2θL(i,yi,θ(t))(5.37a) \theta^{(t+1)} = \theta^{(t)} - \gamma \frac{1}{n/2} \sum_{i=1}^{\frac n2} \nabla_\theta L({\bf }_i, {\bf y}_i, \theta^{(t)}) \tag{5.37a}
θ(t+2)=θ(t+1)γ1n/2i=n2+1nθL(i,yi,θ(t+1))(5.37b)\theta^{(t+2)} = \theta^{(t+1)} - \gamma \frac{1}{n/2} \sum_{i=\frac{n}{2}+1}^{n} \nabla_\theta L({\bf }_i, {\bf y}_i, \theta^{(t+1)}) \tag{5.37b}

Stochastic Gradient Decent Pseudocode

Input Objective Function J(θ)=1ni=1nL(xi,yi,θ)J(\theta)=\frac 1n \sum_{i=1}^n L({\bf x}_i, {\bf y}_i, \theta), initial θ(0)\theta^{(0)}, learning rate γ\gamma Result θ^\hat\theta

  1. Set t0t\leftarrow 0
  2. while convergence criteria not met do
  3.  |   for i{1,2,,E}i\in\{1,2,\dots,E\} do
  4.  |   |   Randomly shuffle the training data {xi,yi}i=1n\lbrace {\bf x}_i, y_i \rbrace_{i=1}^n
  5.  |   |   for j{1,2,,nnb}j\in\{1,2,\dots,\frac{n}{n_b}\} do
  6.  |   |   |   Approximate the gradient using the mini-batch {(xi,yi)}i=(j1)nb+1jnb,d^(t)i=(j1)nb+1jnbθL(xi,yi,θ(t))\lbrace({\bf x}_i, {\bf y}_i)\rbrace_{i=(j-1) n_b + 1}^{jn_b}, \hat{\bf d}^{(t)} \sum_{i=(j-1)n_b + 1}^{jn_b} \nabla_\theta L({\bf x}_i, y_i, \theta^{(t)})
  7.  |   |   |   Update θ(t+1)θ(t)γ(t)d^(t)\theta^{(t+1)} \leftarrow \theta^{(t)} - \gamma^{(t)} \hat{\bf d}^{(t)}
  8.  |   |   |   Update tt+1t\leftarrow t + 1
  9.  |   |    end
  10.  |    end
  11. end
  12. return θ^θ(t1)\hat\theta\leftarrow\theta^{(t-1)}

Learning Rate and Convergence for SGD


There was other content here, but I have decided to omit it because it's taking too much time

Hyperparameter Optimisation