Lindholm Chapter 10 - Generative Models and Unlabelled Data

Gaussian Mixture Model and Discriminant Analysis.

p(x,y)=p(xy)p(y)(10.1a)p({\bf x}, y) = p({\bf x} | y) p(y) \tag{10.1a}
p(y=1)=π1p(y=M)=πM(10.1b) p(y=1) = \pi_1 \\ \vdots \\ p(y=M) = \pi_M \tag{10.1b}
p(xy)=N(xμy,Σy)(10.2) p({\bf x} | y) = \mathcal{N}({\bf x} | {\boldsymbol \mu}_y, {\bf \Sigma}_y) \tag{10.2}

Supervised Learning of the Gaussian Mixture Model

θ^=argmaxθlnp({xi,yi}i=1nTθ)(10.2a)\hat\theta=\arg\max_\theta \ln p(\underbrace{\lbrace {\bf x}_i, y_i \rbrace_{i=1}^n}_{\mathcal{T}} | \theta) \tag{10.2a}
lnp({xi,yi}i=1nθ)=i=1n{lnp(xiyi,θ)+lnp(yiθ)}=i=1nm=1MI{yi=m}{lnN(xiμm,Σm)+lnπm} \begin{align*} \ln p (\lbrace {\bf x}_i, y_i \rbrace_{i=1}^n |\theta) &= \sum_{i=1}^n \lbrace \ln p({\bf x}_i | y_i, \theta) + \ln p(y_i | \theta) \rbrace \\ &= \sum_{i=1}^{n} \sum_{m=1}^M \mathbb{I} \lbrace y_i = m \rbrace \left \lbrace \ln \mathcal{N} ({\bf x}_i | {\bf \mu}_m, {\bf\Sigma}_m) + \ln \pi_m \right \rbrace \tag{10.2b} \end{align*}
π^m=nmn(10.3a) \hat\pi_m =\frac{n_m}{n} \tag{10.3a}
μm=1nmi:yi=m(xiμ^m)(xiμ^m)T(10.3c){\bf\mu}_m = \frac{1}{n_m} \sum{i: y_i = m} ({\bf x}_i - \hat{\bf\mu}_m)({\bf x}_i - \hat{\bf\mu}_m)^T\tag{10.3c}

Predicting Output Labels for New Inputs: Discriminant Analysis

p(yx)=p(x,y)p(x)=p(x,y)j=1Mp(x,y=j)(10.4) \def\x{{\bf x}} p(y|\x) = \frac{p(\x, y)}{p(\x)} = \frac{p(\x, y)}{\sum_{j=1}^M p(\x,y=j)}\tag{10.4}
p(y=mx)=π^mN(xμ^m,Σ^m)j=1Mπ^jN(xμ^j,Σ^j)(10.5) p(y=m |{\bf x_\star}) = \frac{ \hat\pi_m \mathcal{N}\left({\bf x_\star} | \hat{\bf\mu}_m, \hat{\bf\Sigma}_m \right) } { \sum_{j=1}^M \hat\pi_j \mathcal{N}\left( {\bf x_\star} | \hat{\bf\mu}_j, \hat{\bf\Sigma}_j\right) }\tag{10.5}
y^=argmaxmp(y=mx)(10.6) \hat y_\star = \arg\max_m p(y=m | {\bf x_\star}) \tag{10.6}
y^=argmaxm{lnπ^m+lnN(xμ^m,Σ^m)}(10.7) \hat y_\star = \arg\max_m \left\lbrace \ln \hat\pi_m + \ln \mathcal{N}\left({\bf x_\star} | \hat{\bf\mu}_m, \hat{\bf\Sigma}_m\right) \right \rbrace \tag{10.7}

Data Training data T={xi,yi}i=1n\mathcal{T} = \lbrace {\bf x}_i, y_i \rbrace_{i=1}^n.
Result Gaussian Mixture Model

  1. for m=1,,Mm=1, \dots, M do
  2.  |   Compute π^m\hat\pi_m using Equation 10.3a, μ^m\hat{\bf\mu}_m using Equation 10.3b and Σ^\hat{\bf\Sigma} using Equation 10.3c
  3. end

Data Gaussian Mixture Model and test input x\bf x_\star.
Result Prediction y^\hat y_\star.

  1. for m=1,,Mm=1, \dots, M do
  2.  |   δm=deflnpi^m+lnN(xμ^m,Σ^m)\delta_m \overset{\text{def}}= \ln\hat{\bf pi}_m + \ln \mathcal{N}\left({\bf x_\star} | \hat{\bf\mu}_m, \hat{\bf\Sigma}_m\right).
  3. end
  4. Set y^=argmaxmδm\hat y_\star = \arg\max_m \delta_m

Semi-Supervised Learning of the Gaussian Mixture Model

Want to exploit the information available in the unlabelled data to end up with a better model

Figure 1 - Semi-supervised learning, in which which we have fitted a GMM to data in which we do not have most of the labels. The unlabelled data has evidently made the problem harder

θ^=argmaxθlnp({{xi,yi}i=1nl,{xi}i=nl+1n}T)\hat\theta=\arg\max_\theta \ln p( \underbrace{ \lbrace \lbrace {\bf x}_i, y_i\rbrace_{i=1}^{n_l}, \lbrace {\bf x}_i\rbrace_{i=n_l+1}^{n} \rbrace }_{\mathcal{T}} )
wi(m)={p(y=mxiθ^) if yi is missing1if yi=m0otherwise(10.10a) w_i (m) = \begin{cases} p(y=m |{\bf x}_i \hat\theta) & \text{ if } y_i \text{ is missing}\\ 1 & \text{if } y_i=m\\ 0 & \text{otherwise} \end{cases}\tag{10.10a}
π^m=1ni=1nwi(m)μ^m=1i=1nwi(m)i=1nwi(m)xiΣ^m=1i=1nwi(m)i=1nwi(m)(xiμ^m)(xiμ^m)T\begin{align*} \hat\pi_m &= \frac{1}{n} \sum_{i=1}{n} w_i(m) \tag{10.10b}\\ \hat\mu_m &= \frac{1}{\sum_{i=1}^{n} w_i (m)} \sum_{i=1}^{n} w_i(m) {\bf x}_i \tag{10.10c}\\ \hat\Sigma_m &= \frac{1}{\sum_{i=1}^{n} w_i (m)} \sum_{i=1}^{n} w_i(m) ({\bf x}_i - \hat\mu_m) ({\bf x}_i - \hat\mu_m)^T \tag{10.10d} \end{align*}

Data: Partially labelled training data T={{xi,yi}i=1nl,{xi}i=nl+1n}\mathcal{T} = \lbrace\lbrace {\bf x}_i, y_i \rbrace_{i=1}^{n_l}, \lbrace {\bf x}_i \rbrace_{i=n_l+1}^{n}\rbrace with output classes m=1,,Mm=1,\dots,M.
Result Gaussian Mixture Model

  1. Compute θ={π^m,μ^m,Σ^m}m=1M\theta=\lbrace \hat{\bf\pi}_m, \hat{\bf\mu}_m, \hat{\bf\Sigma}_m \rbrace_{m=1}^M according to Equation 10,3 using only the labelled data
  2. repeat
  3.  |   For each xi{\bf x}_i in the unlabelled subset, compute the prediction p(yxi,θ^)p(y|{\bf x}_i, \hat\theta) according to Equation 10.5 using the current parameter estimates
  4.  |   Update the parameter estimates θ^={π^m,μ^m,Σ^m}m=1M\hat\theta=\lbrace \hat{\bf\pi}_m, \hat{\bf\mu}_m, \hat{\bf\Sigma}_m \rbrace_{m=1}^M according to Equation 10.10
  5. until convergence

Cluster Analysis


Unsupervised Learning of the Gaussian Mixture Model

p(x,y)=p(xy)p(y)=N(xμy,Σy)πy(10.11) p({\bf x}, y) = p({\bf x} | y) p(y) = \mathcal{N} \left({\bf x} | {\bf\mu}_y, {\bf\Sigma}_y\right) \pi_y \tag{10.11}

Data Unlabelled training data T={xi}i=1N\mathcal{T}=\lbrace{\bf x}_i \rbrace_{i=1}^N, number of clusters MM.
Result Gaussian Mixture Model

  1. Initialise θ^={π^m,μ^m,Σ^m}m=1M\hat\theta=\lbrace \hat\pi_m, \hat{\bf\mu}_m, \hat{\bf\Sigma}_m \rbrace_{m=1}^M
  2. repeat
  3.  |   For each xi{\bf x}_i in {xi}i=1N\lbrace{\bf x}_i \rbrace_{i=1}^N compute the prediction p(yxi,θ^)p(y|{\bf x}_i, \hat\theta) according to Equation 10.5 using the current parameter estimates θ^\hat\theta
  4.  |   Update the parameter estimates θ^{π^m,μ^m,Σ^m}m=1M\hat\theta\leftarrow\lbrace \hat{\bf\pi}_m, \hat{\bf\mu}_m, \hat{\bf\Sigma}_m \rbrace_{m=1}^M according to Equation 10.16
π^m=1ni=1nwi(m)μ^m=1i=1nwi(m)i=1nwi(m)xiΣ^m=1i=1nwi(m)i=1nwi(m)(xiμ^m)(xiμ^m)T \begin{align*} \hat\pi_m &= \frac 1n \sum_{i=1}^{n} w_i (m) \tag{10.16a}\\ \hat\mu_m &= \frac 1{\sum_{i=1}^n w_i(m)} \sum_{i=1}^n w_i (m) {\bf x}_i \tag{10.16b}\\ \hat {\bf\Sigma}_m &= \frac 1 {\sum_{i=1}^n w_i(m)} \sum_{i=1}^n w_i (m) ({\bf x}_i - \hat\mu_m) ({\bf x}_i - \hat\mu_m)^T \tag{10.16c} \end{align*}

k-Means Clustering