COMP4702 Lecture 9

Chapter 8: Non-Linear Input Transformations and Kernels

Creating Features by Non-Linear Input Transformations

What if we perform some non-linear transformations of the data before passing to the model?

The vanilla linear regression model is denoted as:

y=θ0+θ1x+ε(8.1)y = \theta_0 + \theta_1 x + \varepsilon \tag{8.1}

From this, we can extend the model with x2,x3,,xd1x^2, x^3, \dots, x^{d-1} as inputs in which dd is another hyperparameter, we can obtain the following model:

y=θ0+θ1x+θ2x2++θd1xd1+ε=θTϕ(x)+εy = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_{d-1} x^{d-1} + \varepsilon = \theta^T{\bf\phi} (x) + \varepsilon

Since xx is know, we can directly compute the exponents x2,x3,,xd1x^2, x^3, \dots, x^{d-1}. Note that this is still a linear regression model since the parameters θ\theta appear in a linear fashion with ϕ(x)=[1xx2xd1]T{\bf{\phi}}(x)=\begin{bmatrix}1&x&x^2&\cdots&x^{d-1}\end{bmatrix}^T as a new input vector, a vector of basis functions

% Create our dataset
x = rand(20,1);
y = cos(x)
% Add some noise 
y = y + 0.1*randn(20,1);
x1 = linspace(0,1)
y1 = polyval(p1, x1);
hold on;
plot(x1,y1);
xsqd = x.^2;
xcub = x.^3;
z = [x xsqd xcub];

% Now do polynomial fitting on z
b = ones(20,1); % Bias vector
z = [b z];
% Training step, p3 are our theta values
p3 = inv(z'*z)*z'*y;

% Create test set
y3 = zeros(1,100);
for i=1:100
y3(i) = p3(1) + p3(2)*x1(i) + p3(3)*x1(i.^2) + p3(4)*x1(i.^3);
end
plot(x1,y3);

Kernel Ridge Regression

Figure 1 - Motivation for introducing non-linear basis functions as an input.
κx,x=ϕ(x)Tϕ(x)\bf {\kappa} {\bf x, x'} = \phi({\bf x})^T \phi({\bf x}')
θ^=argminθ1n(θTϕ(x)i)2y^(xi)+λθ22=(Φ(X)TΦ(X)+nλI)1Φ(X)Ty(8.4a)\hat\theta=\arg \min_\theta \frac{1}{n} \underbrace{(\theta^T \phi({\bf x})i)^2}_{\hat y({\bf x}_i)} + \lambda ||\theta||_2^2 = ({\bf\Phi}({\bf X})^T{\bf \Phi }({\bf X}) + n\lambda {\bf I})^{-1} {\bf \Phi} ({\bf X})^T{\bf y}\tag{8.4a}
y^(x)=yT1×nΦ(X)Φ(X)T+nλIn×n1Φ(X)ϕ(x)n×1(8.7) \def\p{\bf \Phi} \def\X{\bf X} \hat y({\bf x_\star}) = \underbrace{{\bf y}^T}_{1\times n} {\underbrace{\p(\X) \p(\X)^T + n \lambda {\bf I}}_{n\times n}}^{-1} {\underbrace{\p(\X){\bf\phi} ({\bf x_\star})}_{n\times 1}} \tag{8.7}
κ(x,x)=exp((xx)2222)(8.13)\kappa({\bf x, x'})=\exp(-\frac{||{\bf(x-x')}||_2^2}{2\ell^2})\tag{8.13}

Matlab Example

% Create our dataset
x = 5 * rand(50,1);
y = x.^2 + 2*randn(50,1); % Our function is y = x^2 (+ noise)

plot(x,y,'.');
% Number of training points
n = 50;           
% Hyperparameter, weight of regularisation in loss function
lambda = 0.01;    
% Kernel function hyperparameter
l = 5;            
d = pdist(x); % Euclidean distance between datapoints
% Evaluate kernel function
k = exp(-(d.^2)/(2*l^2));
K = squareform(k); % Turn into square matrix
alpha = (y'*inv(K + n*lambda*eye(n,n)));
% Prediction
xtest = 0:0.1:5;
dtest = pdist2(xtest',x);
ktest = exp(-(dtest.^2)/(2*l^2));
ytest = ktest*alpha;
hold on;
plot(xtest,ytest,'k');

Support Vector Regression

Figure 2 - Support Vector Regression with Epsilon-insensitive loss. The points with non-zero alpha values are known as the support vectors and are highlighted here in red.

Kernel Theory


Example 8.4 - Kernel k-NN for Interpreting Words using Levenshtein Distance (Edit Distance)

κ(x,x)=exp((LD(x,x))222)\kappa(x,x')=\exp\left(-\frac{(LD(x,x'))^2}{2\ell^2}\right)
Word, xix_i Meaning, yiy_i Levenshtein Distance, LD(xi,x)LD(x_i,x_\star) Kernel, κ(xi,xi)κ(x,x)2κ(xi,x)\kappa(x_i,x_i)-\kappa(x_\star,x_\star)-2\kappa(x_i,x_\star)
'Awesome' Positive 8 1.44
'Excellent' Positive 10 1.73
'Spotless' Positive 9 1.60
'Terrific' Positive 8 1.44
'Tremendous' Positive 4 0.55
'Awful' Negative 9 1.60
'Dreadful' Negative 6 1.03
'Horrific' Negative 6 1.03
'Terrible' Negative 8 1.44

Meaning of a Kernel


Valid Kernels

κ(x,x)=xTx(8.23)\kappa({\bf x, x}')={\bf x}^T{\bf x}'\tag{8.23}
κ(x,x)=(c+xTx)d1(8.24)\kappa({\bf x, x}')=(c +{\bf x}^T{\bf x}')^{d-1}\tag{8.24}

Support Vector Classification

y^(x)=sign(α^TK(X,x))(8.35c)\hat{y}({\bf x_\star}) =\text{sign}(\hat\alpha^T{\bf\Kappa}({\bf X,x_\star}))\tag{8.35c}
Figure 3 - SVM for Classification with Linear and Squared exponential kernel. The points in yellow are the support vectors.