COMP4702 Lecture 10

Chapter 9: Bayesian Approach and Gaussian Process Bayesian Inference

The Bayesian Idea

"The statistical methodology of Bayesian learning is distinguished by its use of probability to express all forms of uncertainty. Learning and other forms of inference can then be performed by what in theory are simple applications of the rules of probability. The results of Bayesian Learning are expressed in terms of a probability distribution over unknown quantities. In general, these probabilities can be interpreted only as expressions on our degree of belief in the various probabilities." - Radford Neal

L(θX)p(Xθ)=i=1np(xiθ)(1)L(\theta|\mathcal{X})\propto p(\mathcal{X}|\theta)=\prod_{i=1}^{n} p({\bf x}_i|\theta)\tag{1}
P(θX)=p(Xθ)p(θ)p(X)L(θX)p(θ)(2)P(\theta|\mathcal{X})=\frac{p(\mathcal{X}|\theta)p(\theta)}{p(\mathcal{X})}\propto L(\theta|\mathcal{X})p(\theta)\tag{2}

That is, posteriorlikelihood×prior\text{posterior}\propto \text{likelihood} \times \text{prior}

Bayesian Linear Regression

Taking the concept of linear regression from before, and using it with Bayesian Inference.

Multivariate Gaussian Distribution


μ=[μ1μ2μq],           Σ=[σ12σ12σ1qσ21σ22σ2qσq1σq2σq2] {\bf\mu}= \begin{bmatrix} \mu_1\\\mu_2\\\vdots\\\mu_q \end{bmatrix}, \ \ \ \ \ \ \ \ \ \ \ {\bf\Sigma}= \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \cdots & \sigma_{1q}\\ \sigma_{21} & \sigma_2^2 & & \sigma_{2q}\\ \vdots & & & \vdots\\ \sigma_{q1} & \sigma_{q2} & \cdots & \sigma_q^2 \end{bmatrix}


p(yy)=N(y;m,s+σ2)(9.11d) p(y_\star | {\bf y}) = \mathcal{N}(y_\star; m_\star, s_\star + \sigma^2) \tag{9.11d}
p(y)=N(y;Xμ0,σ2I+XΣ0XT)(9.12) p({\bf y})=\mathcal{N}\left({\bf y} ; {\bf X\mu}_0, \sigma^2 {\bf I} + {\bf X\Sigma}_0{\bf X}^T\right)\tag{9.12}

Example 9.1 - Bayesian Linear Regression

We consider a simle one-dimensional example, with

y=θ1+θ2x+ε y= \theta_1 + \theta_2 x + \varepsilon

In this first figure, we see the models generated by sampling from the prior.

Figure 1a - Bayesian Linear Regression Prior. (Left): Feature Space, where the features are the intercept and gradient of the line respectively. (Right): Plotted models, with parameters sampled from feature space.

We next see the effect of sampling from the posterior with a single datapoint. Initially, the prior has a broad distribution when no data has been observed.

Figure 1b - Bayesian Linear Regression Posterior with 1 data point. (Left): Feature Space, where the features are the intercept and gradient of the line respectively. (Right): Plotted models, with parameters sampled from feature space.

Extending this, we see the effect when sampling the posterior with three data points, and 30 data points respectively. The distribution continues to shrink as we sample more and more from the data.

Figure 1c - Bayesian Linear Regression Posterior with 3 data points. (Left): Feature Space, where the features are the intercept and gradient of the line respectively. (Right): Plotted models, with parameters sampled from feature space.
Figure 1d - Bayesian Linear Regression Posterior with 3 data points. (Left): Feature Space, where the features are the intercept and gradient of the line respectively. (Right): Plotted models, with parameters sampled from feature space.

After sampling 30 data points, the distribution of points are much more concentrated. We can see how the posterior contracts as we sample more data. Also observe how the blue surface becomes more peaked on the left, and the blue lines in the plot to the right becoming more concentrated around the true line.

Example 9.2 - Car Stopping Distances

y=1+θ1x+εy=1+θ1x+θ2x2+εy=1+\theta_1x + \varepsilon \\ y = 1 + \theta_1 x + \theta_2 x^2 + \varepsilon
Figure 2 - Car stopping distance with Bayesian Linear Regression. (Left): Linear transformation. (Right): Quadratic transformation

Connection to Regularised Linear Regression

The Gaussian Process

Figure 3 - Connection of concepts learned so far.


Figure 4 - Gaussian Process example with a squared exponential covariance function. This was generated using http://smlbook.org/GP/index.html
Figure 5 - Gaussian Process example with a squared exponential covariance function. This was generated using http://smlbook.org/GP/index.html

Training a Gaussian Process Model

m=K(X,x)T(σ2I+κ(X,X))1y(9.25a)m_\star={\bf\Kappa}({\bf X}, {\bf x_\star})^T\left(\sigma^2 {\bf I} + {\bf \kappa} ( {\bf X, X})\right)^{-1}\bf{y}\tag{9.25a}
s=κ(x,x)K(X,x)T(σ2I+K(X,X))1K(X,x)(9.25b)s_\star=\kappa({\bf x_\star,x_\star})-{\bf \Kappa}({\bf X, x_\star})^T \left(\sigma^2 {\bf I} + {\bf \Kappa}({\bf X, X})\right)^{-1} {\bf \Kappa}({\bf X, x_\star}) \tag{9.25b}

Drawing Samples from a Gaussian Process

Practical Aspects of the Gaussian Process