Lindholm Chapter 8

Non-Linear Input Transformations and Kernels

Creating Features by Non-Linear Input Transformations

y=θ0+θ1x+ε(8.1)y=\theta_0 + \theta_1 x + \varepsilon \tag{8.1}
y=θ0+θ1x+θ2x2++θd1xd1+ε=θTϕ(x)+ε(8.2) y = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_{d-1} x^{d-1} + \varepsilon = \theta^T \phi(x) + \varepsilon \tag{8.2}
 replace the original ,X=[x1Tx2TxnT]n×p+1 with the transformed Φ(X)=[ϕ(x)1Tϕ(x)2Tϕ(x)nT]n×p+1 \def\p{{\phi}} \text{ replace the original } , {\bf X} = \underbrace{\begin{bmatrix}{\bf x}_1^T \\{\bf x}_2^T \\ \vdots \\{\bf x}_n^T \end{bmatrix}}_{n\times p + 1} \text{ with the transformed } {\bf\Phi}({\bf X}) = \underbrace{\begin{bmatrix}\p({\bf x})_1^T \\\p({\bf x})_2^T \\ \vdots \\\p({\bf x})_n^T \end{bmatrix}}_{n\times p + 1}
ϕ(x)=[1sin(x)cos(x)sin(2x)cos(2x)]T\phi (x) = \begin{bmatrix}1 & \sin(x) & \cos(x) & \sin(2x) & \cos(2x) & \cdots \end{bmatrix}^T

Kernel Ridge Regression

Ridge Regression = Linear Regression + L2 Regularisation Kernel Ridge Regression = Ridge Regression using Kernels Using kernels allows L2 Linear Regression to use non-linear input transformations

Re-Formulating Linear Regression

θ^=argminθ=1ni=1n(θTϕ(xi)y^(xi)yi)2+λθ22=(Φ(X)TΦ(X)+nλI)1ϕ(X)Ty(8.4a)\hat\theta=\arg\min_\theta = \frac 1n \sum_{i=1}^{n} (\underbrace{\theta^T \phi({\bf x}_i)}_{\hat y({\bf x}_i)} - y_i)^2 + \lambda || \theta ||_2^2 = ({\bf\Phi}({\bf X})^T {\bf \Phi}({\bf X}) + n \lambda {\bf I})^-1 {\bf \phi}({\bf X})^T {\bf y} \tag{8.4a}
y^(x)=θ^Tϕ(x)(8.5)\hat y({\bf x_\star}) = \hat\theta^T \phi({\bf x_\star}) \tag{8.5}
y^(x)=θ^T1×dϕ(x)d×1=(Φ(X)TΦ(X)+nλI)1Φ(X)Tyϕ(x)=yT1×nΦ(X)n×d(Φ(X)TΦ(X)+nλI)1d×dϕ(x)d×1n×1 \def\P{{\bf \Phi}} \def\X{{\bf X}} \def\I{{\bf I}} \begin{align*} \hat y({\bf x_\star}) = \underbrace{\hat\theta^T}_{1\times d} \quad\underbrace{\phi({\bf x_\star})}_{d\times 1} &= (\P(\X)^T \P(\X) + n\lambda \I)^{-1} \P(\X)^T {\bf y} \phi({\bf x_\star}) \\ &=\underbrace{{\bf y}^T}_{1\times n} \underbrace{ \underbrace{\P(\X)}_{n\times d} \underbrace{(\P(\X)^T \P(\X) + n\lambda \I)^{-1}}_{d\times d} \underbrace{\phi({\bf x_\star})}_{d\times 1} \tag{8.6} }_{n\times 1} \end{align*}
y^(x)=yT1×n(Φ(X)Φ(X)T+nλI)n×n1Φ(X)ϕ(x)n×1(8.7) \def\P{{\bf \Phi}} \def\X{{\bf X}} \def\I{{\bf I}} \hat y({\bf x_\star}) = \underbrace{{\bf y}^T}_{1\times n} \quad {\underbrace{(\P(\X) \P(\X)^T + n \lambda \I)}_{n\times n}}^{-1} {\underbrace{\P(\X) \phi({\bf x_\star})}_{n\times 1}} \tag{8.7}
(Φ(X)Φ(X)T)=[ϕ(x1)Tϕ(x1)ϕ(x1)Tϕ(x2)ϕ(x1)Tϕ(xn)ϕ(x2)Tϕ(x1)ϕ(x2)Tϕ(x2)ϕ(x2)Tϕ(xn)ϕ(xn)Tϕ(x1)ϕ(xn)Tϕ(x2)ϕ(xn)Tϕ(xn)](8.8) \def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}} \def\x{{\bf x}} (\P(\X) \P(\X)^T) = \begin{bmatrix} \phi(\x_1)^T \phi(\x_1) & \phi(\x_1)^T \phi(\x_2) & \cdots & \phi(\x_1)^T \phi(\x_n) \\ \phi(\x_2)^T \phi(\x_1) & \phi(\x_2)^T \phi(\x_2) & \cdots & \phi(\x_2)^T \phi(\x_n) \\ \vdots & & \ddots & \vdots \\ \phi(\x_n)^T \phi(\x_1) & \phi(\x_n)^T \phi(\x_2) & \cdots & \phi(\x_n)^T \phi(\x_n) \\ \end{bmatrix} \tag{8.8}
Φ(X)ϕ(x)=[ϕ(x1)Tϕ(x)ϕ(x2)Tϕ(x)ϕ(xn)Tϕ(x)](8.9) \def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}} \def\x{{\bf x}} \P(\X)\phi({\bf x_\star})=\begin{bmatrix} \phi(\x_1)^T \phi({\bf x_\star}) \\ \phi(\x_2)^T \phi({\bf x_\star}) \\ \vdots \\ \phi(\x_n)^T \phi({\bf x_\star}) \\ \end{bmatrix} \tag{8.9}

ϕ(x)Tϕ(x)=[13x3x2x3][13x3x2x3]1+3xx+3x2x2+x3x3=(1+xx)3(8.10)\phi(x)^T \phi(x') = \begin{bmatrix} 1 & \sqrt 3 x & \sqrt 3 x^2 & x^3 \end{bmatrix} \begin{bmatrix} 1 \\ \sqrt 3 x' \\ \sqrt 3 x'^2 \\ x'^3 \end{bmatrix} \\ 1 + 3xx' + 3x^2 x'^2 + x^3 x'^3 = (1 + xx')^3 \tag{8.10}

If we just make a choice of ϕ(x)\phi({\bf x}) such that the inner product ϕ(x)Tϕ(x)\phi(x)^T \phi(x') can be computed without first computing ϕ(x)\phi(x) we can let dd be arbitrarily big.

Kernel

κ(x,x)=ϕ(x)Tϕ(x)(8.11) \kappa({\bf x}, {\bf x'})=\phi({\bf x})^T \phi({\bf x'}) \tag{8.11}

If x\bf x enters the model as ϕ(x)Tϕ(x)\phi({\bf x})^T\phi({\bf x}') only we can choose a kernel κ(x,x)\kappa({\bf x}, {\bf x'}) instead of choosing ϕ(x)\phi({\bf x}).

y^(x)=yT1×n(K(X,X)+nλI)n×n1K(X,x)n×1whereK(X,X)=[κ(x1,x1)κ(x1,x2)κ(x1,xn)κ(x2,x1)κ(x2,x2)κ(x2,xn)κ(xnx1)κ(xn,x2)κ(xn,xn)]K(X,x)=[κ(x1,x)κ(x2,x)κ(xn,x)] \def\X{{\bf X}}\def\x{{\bf x}}\def\I{{\bf I}}\def\P{{\bf\Phi}}\def\K{{\bf\Kappa}} \begin{align*} \hat y({\bf x_\star}) &= \underbrace{{\bf y}^T}_{1\times n} \quad {\underbrace{(\K(\X,\X) + n\lambda \I)}_{n\times n}}^{-1} {\underbrace{\K(\X, {\bf x_\star})}_{n\times 1}} \tag{8.12b}\\ \text{where} \K(\X,\X)&=\begin{bmatrix} \kappa(\x_1, x_1) & \kappa(\x_1, \x_2) & \cdots & \kappa(\x_1,x_n)\\ \kappa(\x_2, \x_1) & \kappa(\x_2, \x_2) & \cdots & \kappa(\x_2, \x_n)\\ \vdots && \ddots \vdots \\ \kappa(\x_n \x_1) & \kappa(\x_n, \x_2) & \cdots & \kappa(\x_n, \x_n) \end{bmatrix}\tag{8.12b}\\ \K(\X,{\bf x_\star}) &= \begin{bmatrix} \kappa(\x_1, \x_\star) \\ \kappa(\x_2, \x_\star) \\ \vdots \\ \kappa(\x_n, \x_\star) \\ \end{bmatrix}\tag{8.12bc}\\ \end{align*}
κ(x,x)=exp(xx2222)(8.13) {\bf \kappa}({\bf x, x'})=\exp\left(-\frac{||{\bf x-x'}||_2^2}{2\ell^2}\right) \tag{8.13}
κ(x,x)=(c+xTx)d1 {\bf\kappa}({\bf x,x'}) = (c + {\bf x}^T {\bf x}')^{d-1}
κ(x,x)=c+xTx {\bf\kappa}({\bf x,x'}) = c + {\bf x}^T {\bf x}'

Learn Kernel Ridge Regression
Data Training data T={xi,yi}i=1n\mathcal{T}=\{{\bf x}_i, y_i\}_{i=1}^{n} and kernel κ\kappa Result Learned dual parameters α^\hat\alpha

  1. Compute α^\hat\alpha as per Equation 8.14a

Predict with Kernel Ridge Regression Data Learned dual parameters α^\hat\alpha and test input x\bf x_\star

  1. Compute y^(x)\hat y({\bf x_\star}) as per Equation 8.14b

Support Vector Regression

A version of Kernel Ridge Regression, but with a different loss function

Representer Theorem

y^(x)=θ^Tϕ(x)=α^TΦ(X)ϕ(x)K(X,x)(8.15) \hat y ({\bf x_\star})=\hat\theta^T \phi({\bf x_\star}) = \hat\alpha^T \underbrace{{\bf \Phi}({\bf X}) \phi({\bf x_\star})}_{\bf \Kappa({\bf X,x_\star})}\tag{8.15}
θ^=Φ(X)Tα^(8.16) \hat\theta=\Phi({\bf X})^T \hat\alpha \tag{8.16}

Support Vector Regression

The same as Kernel Ridge Regression, but using ϵ\epsilon-insensitive loss

L(y,y^)={0if yy^<ϵ,yy^ϵotherwise(8.17) L(y,\hat y) = \begin{cases}0 & \text{if } |y-\hat y| < \epsilon, \\ |y - \hat y| - \epsilon & \text{otherwise}\end{cases} \tag{8.17}
y^(x)θTϕ(x)(8.18a) \hat y({\bf x_\star}) \theta^T \phi({\bf x_\star})\tag{8.18a}
θ^=argminθ1ni=1nmax{0,yiθTϕ(x)y^(xi)ϵ}+λθ22(8.18b) \hat\theta=\arg\min_\theta \frac 1n \sum_{i=1}^{n} \max \{0, | y_i - \underbrace{\theta^T \phi({\bf x})}_{\hat y({\bf x}_i)}| - \epsilon\} + \lambda || \theta ||_2^2 \tag{8.18b}
y^(x)=α^TK(X,x)(8.19a)\hat y ({\bf x_\star}) = \hat\alpha^T {\bf \Kappa}({\bf X}, {\bf x_\star}) \tag{8.19a}
α^=argminα12αTK(X,X)ααTy+ϵα1(8.19b) \hat\alpha = \arg\min_\alpha \frac 12 \alpha^T {\bf\Kappa}({\bf X}, {\bf X})\alpha-\alpha^T{\bf y} + \epsilon ||\alpha||_1\tag{8.19b}

Support Vector Classification

y^(x)=sign{θTϕ(x)}(8.32)\hat y({\bf x}) = \text{sign} \{\theta^T \phi({\bf x_\star})\} \tag{8.32}
L(x,y,θ)=max{0,1yiθTϕ(x)}={1yθTϕ(x)if yθTϕ(x)<10otherwise(8.33) L({\bf x}, y, \theta) = \max\{0, 1-y_i \theta^T \phi({\bf x})\} = \begin{cases} 1 - y\theta^T \phi({\bf x}) & \text {if } y\theta^T\phi({\bf x}) < 1 \\ 0 & \text{otherwise} \end{cases}\tag{8.33}
θ^=argminθ1ni=1nmax{0,1yiθTϕ(x)}+λθ22(8.34) \hat\theta = \arg\min_\theta \frac 1n \sum_{i=1}^{n} \max\{0, 1-y_i \theta^T \phi({\bf x})\} + \lambda ||\theta||_2^2 \tag{8.34}
α^=argminα12αTK(X,X)ααTysubject to αi12nλ and 0αiyiwith y^=sign(α^TK(X,x)) \begin{align*} \hat\alpha&=\arg\min_\alpha \frac 12 \alpha^T {\bf \Kappa}({\bf X,X})\alpha-\alpha^T y\tag{8.35a}\\ \text{subject to } |\alpha_i| &\le \frac{1}{2n\lambda} \text{ and } 0 \le \alpha_i y_i \tag{8.35b}\\ \text{with } \hat y &= \text{sign}\left(\hat\alpha^T {\bf \Kappa}({\bf X, x_\star})\right)\tag{8.35c} \end{align*}
Figure 1 The decision boundaries for support vector classification with linear kernel (left) and squared exponential kernel (right).
- As a consequence of using hinge loss, support vector classification provides hard classifications. - Can use squared hinge loss or Huberised squared hinge loss which allows for a probabilistic interpretation of the margin.