Creating Features by Non-Linear Input Transformations
We can make use of arbitrary non-linear transformations of the original input values in any model, including linear regression.
For a one-dimensional input, the linear regression model is given as
y=θ0+θ1x+ε(8.1)
We can extend this model with x2,x3,…,xd as inputs (where d is a user choice) and thus obtain a linear regression model which is a polynomial in x
y=θ0+θ1x+θ2x2+⋯+θd−1xd−1+ε=θTϕ(x)+ε(8.2)
Since x is known, we can directly compute x2,…,xd−1.
Note that this is still a linear regression model since the parameters θ appear in a linear transformation with ϕ(x)=[1xx2…xd−1]T as a new input vector.
We refer to a transformation of x as a feature, and the vector of transformed inputs ϕ(x) as a vector of dimension d×1 as a feature vector.
The parameters θ^ are still learned in the same way, but we:
replace the original ,X=n×p+1x1Tx2T⋮xnT with the transformed Φ(X)=n×p+1ϕ(x)1Tϕ(x)2T⋮ϕ(x)nT
The idea of non-linear input transformations is not limited to linear regression, and any choice of non-linear transformation ϕ(⋅) can be used with any supervised machine learning technique.
The non-linear transformation is applied to the input like a pre-processing step and the transformed input is used when training, evaluating and using the model.
Polynomials are only one out of infinitely many possible choices of features ϕ(x).
Polynomials higher than second order must be carefully used in practice - exponential behaviour outside of observable range.
There are several alternatives that are often more useful, such as the Fourier series, essentially corresponding to:
ϕ(x)=[1sin(x)cos(x)sin(2x)cos(2x)⋯]T
The use of non-linear input transformations ϕ(x) arguably makes simple models more flexible and applicable to real-world problems with non-linear characteristics.
In order to obtain good performance, important to choose ϕ(x) to ensure that enough flexibility is obtained but overfitting is avoided.
Explore the idea of letting the number of features d→∞ and combine this with regularisation.
In a sense, this will automate the choice of features, and it leads us to a family of powerful off the shelf machine learning tools called kernel methods.
Kernel Ridge Regression
Ridge Regression = Linear Regression + L2 Regularisation
Kernel Ridge Regression = Ridge Regression using Kernels
Using kernels allows L2 Linear Regression to use non-linear input transformations
A carefully engineered transformation may work for a specific machine learning problem, but it is not a general solution.
We would like ϕ(x) to contain a lot of transformations that could possibly be of interest for most problems - to obtain a general off-the-shelf method.
Explore the idea of letting d→∞.
Re-Formulating Linear Regression
To let d→∞ have to use some kind of regularisation to avoid overfitting when d>n.
The optimisation for L2-regularised linear regression is given as:
This expression for y^(x⋆) suggests that instead of computing and storing the d-dimensional θ^ once (independently of x⋆) we could compute the n-dimensional vector Φ(X)(Φ(X)TΦ(X)+nλI)−1ϕ(x⋆) for each test point.
By doing so, we avoid storing a d-dimensional vector.
However, this technique still requires a matrix inversion which is computationally intensive.
The push-through matrix identity says that A(AT+I)−1=(AAT+I)−1A holds for any matrix A.
Using it in the above equation, we can further re-write the prediction as:
It appears as if we can compute y^(x⋆) without having to deal with d-dimensional or matrix, provided that the matrix multiplications (Φ(X)Φ(X)T) and Φ(X)ϕ(x⋆) can be computed.
Remember that ϕ(x)Tϕ(x′) is the inner product between the two d-dimensional vectors ϕ(x) and ϕ(x′).
Note that the transformed inputs ϕ(x) enter into Equation 8.7 only as inner products, where each inner product is a scalar.
That is, if we are able to compute the inner product directly without first explicitly computing the d-dimensional ϕ(x) vectors, we can avoid the d-dimensional computations and storage - we have reached our goal.
Consider the case for polynomials - if p=1 (meaning that X is a scalar x, and ϕ(x) is a third-order polynomial d=4, with the second and third term scaled by 3,2 we have:
It can be shown that if ϕx is a suitable re-scaled polynomial of order d−1 then ϕ(x)Tϕ(x′)=(1+xx′)d−1.
Instead of computing the two d-dimensional vectors ϕ(x),ϕ(x′) and then computing the inner product, we could just evaluate the expression (1+xx′)d−1 directly.
Consider the computational scaling in a situation where d is in the hundreds or thousands.
If we just make a choice of ϕ(x) such that the inner product ϕ(x)Tϕ(x′) can be computed without first computing ϕ(x) we can let d be arbitrarily big.
This choice of linear transformation can be quite difficult to create and derive, but we can bypass this using the concept of a kernel.
Kernel
A kernel κ(x,x′) is any function that takes in two arguments from the same space and returns a scalar.
Limit choice of kernel to kernels that are real-valued and symmetric: κ(x,x′)=κ(x′,x)∈R
The "kernel" that we used before - the inner product of the input transformations is an example of a kernel:
κ(x,x′)=ϕ(x)Tϕ(x′)(8.11)
Since ϕ(x) only appears in the linear regression model via inner products, we do not have to design a d-dimensional vector ϕ(x) and derive its inner product.
Instead, we can just use the inner product directly. This is known as the kernel trick:
If x enters the model as ϕ(x)Tϕ(x′) only we can choose a kernel κ(x,x′) instead of choosing ϕ(x).
To be clear, we can write Equation 8.7 (linear regression using ϕ(x)) using the kernel
These equations describe linear regression with L2 regularisation using a kernel κ(x,x′)
Since L2 regularisation is is also called ridge regression, we refer to Equation 8.12 as kernel ridge regression.
In principle, we may choose the kernel κ(x,bfx′) arbitrarily, just as long as we can compute Equation 8.12a.
Requires that the inverse of K(X,X)+nλI exists - positive semidefinite kernels.
Positive semidefinite kernels include squared-exponential (RBF, exponentiated quadratic or Gaussian kernel) shown below in Equation 8.13
In this kernel function ℓ>0 is a design choice left to the user.
κ(x,x′)=exp(−2ℓ2∣∣x−x′∣∣22)(8.13)
Another kernel function is the polynomial kernel:
κ(x,x′)=(c+xTx′)d−1
And this kernel has the special case of the linear kernel:
κ(x,x′)=c+xTx′
From Equation 8.12, it may seem as if we have to compute the inverse of K(X,X)+nλI every time we want to make a prediction
Not necessary, since the matrix doesn't input depend on the test point x⋆.
Therefore, we can introduce the n-dimensional vector α^ such that
α^=α^1α^2⋮α^n=yT(K(X,X)+nλI)−1(8.14a)
This then allows us to re-write kernel ridge regression as:
y^(x⋆)=α^K(X,x⋆)(8.14b)
This means that we don't need to store a d-dimensional vector θ^ but we ned to store an n-dimensional vector α^ and X (since we need to compute K(X,X)).
The use of kernel ridge regression is summarised as:
Learn Kernel Ridge Regression Data Training data T={xi,yi}i=1n and kernel κResult Learned dual parameters α^
Compute α^ as per Equation 8.14a
Predict with Kernel Ridge RegressionData Learned dual parameters α^ and test input x⋆
Compute y^(x⋆) as per Equation 8.14b
Support Vector Regression
A version of Kernel Ridge Regression, but with a different loss function
Representer Theorem
Interpret Equation 8.14 as a dual formulation of linear regression, where we have dual parameters α instead of primal parameters θ
Here the idea of "primal" and "dual" are just different ways of expressing the linear regression idea.
Comparing Equation 8.14a and 8.5, we have:
y^(x⋆)=θ^Tϕ(x⋆)=α^TK(X,x⋆)Φ(X)ϕ(x⋆)(8.15)
This suggests that:
θ^=Φ(X)Tα^(8.16)
This relationship between θ and αis not specific for kernel ridge regression, but the consequence of a general result called the representer theorem.
This holds true when $\theta% is learned using almost any loss function and L2 regularisation.
Support Vector Regression
The same as Kernel Ridge Regression, but using ϵ-insensitive loss
The use of ϵ-insensitive loss causes the elements of α^ to become sparse (some elements become zero).
The training points corresponding to the non-zero elements of α^ are referred to as support vectors.
The prediction y^(x⋆) will only only depend on the support vectors.
Reformulate the primal formulation in Equation 8.18 into a dual formulation (from using θ to using α).
Cannot use the the closed-form derivation - the dual formulation becomes:
y^(x⋆)=α^TK(X,x⋆)(8.19a)
In this α^ is the optimisation problem:
α^=argαmin21αTK(X,X)α−αTy+ϵ∣∣α∣∣1(8.19b)
Equation 8.19b is the equivalent to 8.14b for Kernel Ridge Regression and is a consequent of kernel ridge regression.
A larger value of ϵ will result in fewer support vectors.
The number of support vectors is also affected by λ, since λ influences the shape of y^(x⋆).
All data is used at training time (for solving Equation 8.19b) but only the support vectors contribute toward the prediction.
Equation 8.19b is a constrained optimisation problem (constrained by Equation 8.19c).
Support Vector Classification
Possible to derive a kernel version of logistic regression with L2 regularisation
Consider the binary classification problem y∈{−1,1} and start with the margin formulation of the logistic regression classifier.
y^(x)=sign{θTϕ(x⋆)}(8.32)
If we were to learn θ using logistic loss, we would obtain logistic regression with non-linear feature transformation ϕ(x) from which kernel logistic regression would eventually follow.
Cannot use the kernel trick as the feature vector does not appear as ϕ(x)Tϕ(x), however can get the following formulation:
α^subject to ∣αi∣with y^=argαmin21αTK(X,X)α−αTy≤2nλ1 and 0≤αiyi=sign(α^TK(X,x⋆))(8.35a)(8.35b)(8.35c)
Support Vector Classification can utilise different kernel functions which result in different decision boundaries
Figure 1 The decision boundaries for support vector classification with linear kernel (left) and squared exponential kernel (right).
- As a consequence of using hinge loss, support vector classification provides hard classifications.
- Can use squared hinge loss or Huberised squared hinge loss which allows for a probabilistic interpretation of the margin.