Chapter 9 Summary

The Bayesian Idea


p(y)=p(yθ)p(θ)dθ=p(yθ)p(θ)dθ(9.2)p({\bf y})=\int p({\bf y}|\theta)p(\theta)d\theta = \int p({\bf y}|\theta)p(\theta) d\theta \tag{9.2}
p(yx)=p(yθ)p(θy)dθ(9.3) p(y_\star|{\bf x_\star}) = \int p(y_\star|\theta)p(\theta|{\bf y})d\theta \tag{9.3}

Representation of Beliefs.


Bayesian Linear Regression

μ=[μ1μ2μq],Σ=[σ12σ12σ1qσ21σ22σ2qσq1σq2σq2] {\bf\mu}=\begin{bmatrix}\mu_1\\\mu_2\\\vdots\\\mu_q\end{bmatrix}, {\bf\Sigma} = \begin{bmatrix} \sigma_{1}^2&\sigma_{12}&\dots&\sigma_{1q}\\ \sigma_{21} & \sigma_{2}^2 & & \sigma_{2q}\\ \vdots & & & \vdots\\ \sigma_{q1} & \sigma_{q2} & \dots & \sigma_{q}^2 \end{bmatrix}

From Chapter 3, the linear regression model is given by:

y=f(x)+ε,f(x)=θTx,εN(0,σ2)(9.6) y=f({\bf x}) + \varepsilon, f({\bf x})=\theta^T{\bf x}, \varepsilon \sim \mathcal{N}(0,\sigma^2)\tag{9.6}
p(yθ)=N(y;θTx,σ2)(9.7) p(y|\theta)=\mathcal{N}(y;\theta^T{\bf x},\sigma^2) \tag{9.7}

p(yθ)=i=1np(yiθ)=i=1nN(yi;θTxi,σ2)=N(y;Xθ,σ2I(9.8)p({\bf y}|\theta)=\prod_{i=1}^n p(y_i|\theta)=\prod_{i=1}^n\mathcal{N}(y_i;\theta^T{\bf x}_i,\sigma^2) =\mathcal{N}({\bf y};{\bf X}\theta, \sigma^2{\bf I}\tag{9.8}
p(θ)=N(θ;μ0,Σ0)(9.9) p(\theta)=\mathcal{N}(\theta;{\bf\mu}_0,{\bf\Sigma}_0)\tag{9.9}
p(θy)=N(θ;μn,Σn)μn=Σn(1σ02μ0+1σ2XTy)Σn=(1σ02I+frac1σ02XTX) \begin{align*} p(\theta|{\bf y})&=\mathcal{N}(\theta;{\bf \mu}_n, {\bf \Sigma}_n) \tag{9.10a} \\ {\bf\mu}_n&={\bf\Sigma}_n \left(\frac{1}{\sigma_0^2} {\bf\mu}_0 + \frac{1}{\sigma^2} {\bf X}^T {\bf y}\right) \tag{9.10b} \\ {\bf\Sigma}_n &= \left(\frac{1}{\sigma_0^2} {\bf I} + frac{1}{\sigma_0^2} {\bf X}^T {\bf X}\right) \tag{9.10c} \end{align*}
p(f(xy))=N(f(x);m,s),m=xTμns=xTΣnx+σ2 \begin{align*} p(f({\bf x_\star} | {\bf y})) &= \mathcal{N}(f({\bf x_\star}); m_\star, s_\star), \tag{9.11a}\\ m_\star&={\bf x}_\star^T {\bf\mu}_n \tag{9.11b}\\ s_\star&={\bf x}_\star^T {\bf\Sigma}_n {\bf x}_\star + \sigma^2 \tag{9.11c} \end{align*}

Connection to Regularised Linear Regression

The Gaussian Process


p([f1f2])=N([f1f2];[μ1μ2],[Σ11Σ12Σ21Σ22])(9.16) p\left(\begin{bmatrix}{\bf f}_1 \\ {\bf f}_2 \end{bmatrix}\right) =\mathcal{N} \left( \begin{bmatrix}{\bf f}_1 \\ {\bf f}_2 \end{bmatrix}; \begin{bmatrix}{\bf \mu}_1 \\ {\bf \mu}_2 \end{bmatrix}, \begin{bmatrix}{\bf \Sigma}_{11} & {\bf \Sigma}_{12} \\ {\bf \Sigma}_{21} & {\bf \Sigma}_{22} \end{bmatrix} \right) \tag{9.16}
p(f2f1)=N(f2;μ2+Σ21Σ111(f1μ1),Σ22Σ21Σ111Σ12)(9.17) p({\bf f}_2|{\bf f}_1) = \mathcal{N}({\bf f}_2;{\bf \mu}_2 + {\bf \Sigma}_{21} {\bf\Sigma}_{11}^{-1}({\bf f}_1-{\bf \mu}_1), {\bf\Sigma}_{22} - {\bf\Sigma}_{21}{\bf\Sigma}_{11}^{-1}{\bf\Sigma}_{12}) \tag{9.17}
Figure 1 - Gaussian distribution for random variables f1 and f2 before and after sampling a value.
Figure 3 - Gaussian distribution for random variables f1 and f2 before and after sampling a value.
Figure 4 - A six-dimensional Gaussian distribution conditioned on an observation of $f_4$. Note here that only the marginals in both subplots are plotted.
p([f(x1)f(xn)])=N([f(x1)f(xn)];[μ(x1)μ(xn)],[κ(x1,x1)κ(x1,xn)κ(xn,x1)κ(xn,xn)])(9.18)p\left( \begin{bmatrix} f({\bf x}_1) \\ \vdots \\ f({\bf x}_n) \end{bmatrix}\right) = \mathcal{N}\left( \begin{bmatrix} f({\bf x}_1) \\ \vdots \\ f({\bf x}_n) \end{bmatrix}; \begin{bmatrix} \mu({\bf x}_1) \\ \vdots \\ \mu({\bf x}_n) \end{bmatrix}, \begin{bmatrix} \kappa({\bf x}_1, {\bf x}_1) & \cdots & \kappa({\bf x}_1, {\bf x}_n) \\ \vdots & & \vdots \\ \kappa({\bf x}_n, {\bf x}_1) & \cdots & \kappa({\bf x}_n, {\bf x}_n) \end{bmatrix} \right) \tag{9.18}
fGP(μ(x),κ(x,x))(9.19)f\sim\mathcal{GP}(\mu({\bf x}), \kappa({\bf x,x'})) \tag{9.19}
p((f(x)f(X)))=N((f(x)f(X));(μ(x)μ(X)),(κ(x,x)κ(x,X)Tκ(X,x)κ(X,X)))(9.20) p\left( \begin{pmatrix} f({\bf x_\star}) \\ f({\bf X}) \end{pmatrix} \right) = \mathcal{N}\left( \begin{pmatrix} f({\bf x_\star}) \\ f({\bf X}) \end{pmatrix}; \begin{pmatrix} \mu({\bf x_\star}) \\ \mu({\bf X}) \end{pmatrix}, \begin{pmatrix} \kappa({\bf x_\star, x_\star}) & \kappa({\bf x_\star, X})^T \\ \kappa({\bf X, x_\star}) & \kappa({\bf X, X}) \end{pmatrix} \right)\tag{9.20}
p(f(x)f(X))=N(f(x);μ(x)+K(X,x)TK(X,X)1×(f(X)μ(X)),κ(x,x)K(X,x)TK(X,X)1K(X,x))(9.21) p(f({\bf x_\star}) | f({\bf X})) = \mathcal{N}(f({\bf x_\star}); \mu({\bf x_\star}) + {\bf\Kappa}({\bf X}, {\bf x}_\star)^T{\bf\Kappa}({\bf X,X})^{-1} \\ \times (f({\bf X}) -\mu({\bf X})), \kappa({\bf x_\star, x_\star}) - {\bf\Kappa}({\bf X, x_\star})^T{\bf\Kappa({\bf X,X})^{-1}} {\bf\Kappa}({\bf X, x_\star}))\tag{9.21}

κ(x,x)=I{x=x={1if x=x0otherwise(9.22)\kappa({\bf x, x'}) = \mathbb{I}\lbrace {\bf x} = {\bf x'} = \begin{cases}1&\text{if }{\bf x}={\bf x'}\\0&\text{otherwise}\end{cases} \tag{9.22}

Extending Kernel Ridge Regression into a Gaussian Process


p(f(x)y)=N(f(x);μ,s)m=ϕ(x)T(σ2IΦ(X)TΦ(X))1Φ(X)Tys=ϕ(x)T(I+1σ2Φ(X)TΦ(X))1ϕ(x) \def\xstar{{\bf x_\star}} \def\y{{\bf y}} \begin{align*} p(f(\xstar) | \y) &= \mathcal{N}(f(\xstar); \mu_\star, s_\star) \tag{9.23a} \\ m_\star&=\phi(\xstar)^T (\sigma^2 {\bf I} {\bf\Phi}({\bf X})^T {\bf\Phi}({\bf X}))^{-1} {\bf\Phi}({\bf X})^T {\bf y} \tag{9.23b}\\ s_\star &= \phi(\xstar)^T ({\bf I} + \frac 1{\sigma^2} {\bf\Phi}({\bf X})^T {\bf\Phi}({\bf X}))^{-1} \phi(\xstar) \tag{9.23c} \end{align*}
m=ϕ(x)TΦ(X)T(σ2I+Φ(X)Φ(X)T)1y(9.24a) {\bf m}_\star = \phi({\bf x_\star})^T {\bf\Phi}({\bf X})^T \left(\sigma^2 {\bf I} + {\bf \Phi}({\bf X}){\bf\Phi}({\bf X})^T\right)^{-1} {\bf y}\tag{9.24a}
s=ϕ(x)Tϕ(x)ϕ(x)TΦ(X)T(σ2I+Φ(X)Φ(X)T)1Φ(X)ϕ(x)(9.24b)s_\star = \phi({\bf x_\star})^T\phi({\bf x_\star}) - \phi({\bf x_\star})^T {\bf\Phi}({\bf X})^T \left(\sigma^2 {\bf I} + {\bf \Phi}({\bf X}){\bf\Phi}({\bf X})^T\right)^{-1} {\bf\Phi}({\bf X})\phi({\bf x_\star})\tag{9.24b}
m=K(X,x)T(σ2I+K(X,X))1ys=κ(x,x)K(X,x)T(σ2I+K(X,X))1K(X,x) \begin{align*} {\bf m}_\star &= {\bf\Kappa}({\bf X, x_\star})^T \left(\sigma^2 {\bf I} + {\bf\Kappa}({\bf X,X})\right)^{-1} {\bf y}\tag{9.25a}\\ s_\star &= \kappa({\bf x_\star, x_\star}) - {\bf\Kappa}({\bf X, x_\star})^T \left(\sigma^2 {\bf I} + {\bf\Kappa}({\bf X,X})\right)^{-1} {\bf\Kappa}({\bf X, x_\star})\tag{9.25b} \end{align*}

A Non-Parametric Distribution over Functions

Drawing Samples from a Gaussian Process

There is some further content on the practical aspects of the gaussian process which I have omitted