Lindholm et al, Chapter 7

Bagging and Boosting

Bagging


Bootstrap

The technique used to (artificially)create the different 'versions' of the base dataset.

  1. For i=1,,ni=1,\cdots, n do
  2.  |   Sample \ell uniformly on the set of integers {1,,n}\lbrace1,\cdots,n\rbrace
  3.  |   Set x~i=x\tilde{\bf{x}}_i={\bf{x}}_\ell and y~i=y\tilde{y}_i=y_\ell
  4. end

Variance Reduction by Averaging

E=[1Bb=1Bzb]=μ(7.2a)\mathbb{E}=\left[\frac{1}{B}\sum_{b=1}^{B} z_b\right]=\mu\tag{7.2a}
Var=[1Bb=1Bzb]=1ρBσ2+ρσ2(7.2b)\text{Var}=\left[\frac{1}{B} \sum{b=1}{B} z_b\right]=\frac{1-\rho}{B} \sigma^2 + \rho\sigma^2\tag{7.2b}
y^bag(x)=1Bb=1By~(b)(x)         or         gbag(x)=1Bb=1Bg~(b)(x)(7.1) \hat{y}_\text{bag}({\bf{x}_\star})=\frac{1}{B}\sum_{b=1}{B} \tilde{y}^{(b)}({\bf{x}_\star}) \ \ \ \ \ \ \ \ \ \text{or} \ \ \ \ \ \ \ \ \ {\bf{g}}_\text{bag}({\bf{x}_\star})=\frac{1}{B}\sum_{b=1}{B}\tilde{{\bf{g}}}^{(b)}({\bf{x}_\star}) \tag{7.1}

The training of the base models is given in the pseudocode below in Method 7.1

Data Training set T={xiyi}i=1n\mathcal{T} = \lbrace{\bf x}_i y_i\rbrace_{i=1}^{n} Result BB Base models

  1. For b=1,,Bb=1,\dots,B do
  2.  |   Run Algorithm 7.1 to obtain a bootstrapped training dataset T~(b)\widetilde{\mathcal{T}}^{(b)}
  3.  |   Learn a base model from T~(b)\widetilde{\mathcal{T}}^{(b)}
  4. end
  5. Obtain y^bag\hat y_{\text{bag}} or gbag{\bf g}_\text{bag} by averaging 7.1.

The prediction using base models is given as: Data BB base models and test input x\bf x_\star Result A prediction y^bagx\hat y_\text{bag} {\bf x_\star} or gbag(x){\bf g}_\text{bag}({\bf x})

  1. For b=1,,Bb=1,\dots,B do
  2.  |   Use base model bb to predict y~(b)\widetilde{y}^{(b)} or g~(b)(x)\widetilde{{\bf g}}^{(b)}({\bf x}_\star)
  3. end
  4. Obtain y^bag\hat y_{\text{bag}} or gbag{\bf g}_\text{bag} by averaging 7.1.

Out-of-Bag Error Estimation

Random Forests




Boosting and Ada-Boosting


AdaBoost

First successful implementation of boosting algorithm

y^boostB=sign{b=1Bα(b)y^(b)(x)}(7.4)\hat y_\text{boost}^{B}=\text{sign}\left\lbrace \sum_{b=1}^{B} \alpha^{(b)} \hat y^{(b)}({\bf x})\right\rbrace\tag{7.4}
L(yf(x))=exp(yf(x))(7.5)L(y \cdot f({\bf x})) = \exp(-y \cdot f({\bf x}))\tag{7.5}
f(b)(x)=f(b1)(x)+α(b)y^(b)(x)(7.6) f^{(b)}({\bf x})=f^{(b-1)}({\bf x})+\alpha^{(b)} \hat y^{(b)}({\bf x})\tag{7.6}
(α(b),y^(b))=argmin(α,y^)i=1nL(yif(b)(xi))=argmin(α,y^)i=1nexp(yi(f(b1)(xi)+αy^(xi)))=argmin(α,y^)i=1nexp(yi)f(b1)(xi)=wi(b1)exp(yiαy^(xi)) \begin{align*} (\alpha^{(b)}, \hat y^{(b)}) &= \arg\min_{(\alpha, \hat y)} \sum_{i=1}^{n} L(y_i \cdot f^{(b)}({\bf x}_i)) \tag{7.7a}\\ &= \arg\min_{(\alpha, \hat y)} \sum_{i=1}^{n} \exp \left(-y_i \left(f^{(b-1)}({\bf x}_i) + \alpha \hat y({\bf x}_i)\right)\right) \tag{7.7b}\\ &=\arg\min_{(\alpha, \hat y)} \sum_{i=1}^{n} \underbrace{\exp (-y_i) f^{(b-1)}({\bf x}_i)}_{=w_i^{(b-1)}} \exp (-y_i \alpha \hat y({\bf x}_i)) \tag{7.7c}\\ \end{align*}

In the equation above:

  1. Use the definition of exponential loss and the sequential structure of the boosted classifier
  2. We use the exponential loss function property exp(a+b)=exp(a)exp(b)\exp (a+b)=\exp(a)\exp(b) to define the following quantity which are interpreted as the individual weights of the data points in the training dataset as per Equation 7.8
wi(b)=defexp(yif(b1)(xi))(7.8)w_i^{(b)} \overset{\text{def}}{=} \exp(-y_i f^{(b-1)} ({\bf x}_i)) \tag{7.8}
At this point, they continue the AdaBoost training derivation - I don't think this is particular relevant
y^(b)=argminy^i=1nwi(b)I(yiy^(xi))(7.9)\hat y^{(b)}=\arg\min_{\hat y} \sum_{i=1}^{n} w_i^{(b)} \mathbb{I}(y_i \neq \hat y({\bf x}_i)) \tag{7.9}

In practice the way that the optimisation problem is solved depends on the choice of base classifier.


The AdaBoost training algorithm is given as:
Data: Training data T{xi,yi}i=1n\mathcal{T}-\lbrace{\bf{x}}_i, y_i\rbrace_{i=1}^{n}
Result: BB weak classifiers

  1. Assign weights wi(1)=1nw_i^{(1)}=\frac{1}{n} to all datapoints
  2. for b=1,,Bb=1,\cdots,B do
  3.  |    Train a weak classifier y^(b)(x)\hat{y}^{(b)}({\bf{x}}) on the weighted training data denoted {(xi,yi,wi(b))}i=1n\lbrace ({\bf{x}}_i, y_i, w_i^{(b)})\rbrace_{i=1}^{n}
  4.  |    Compute Etrain(b)=i=1nwi(b)I{yiy^(b)(xi)}E_\text{train}^{(b)}=\sum_{i=1}^{n} w_i^{(b)} \mathbb{I}\lbrace y_i\ne\hat{y}^{(b)}({\bf{x}}_i)\rbrace
  5.  |    Compute α(b)=0.5ln((1Etrain(b))/Etrain(b))\alpha^{(b)}=0.5\ln((1-E_\text{train}^{(b)})/E_\text{train}^{(b)})
  6.  |    Compute wi(b+1)=wi(b)exp(α(b)yiy^(b)(x)),y=1,,nw_i^{(b+1)}=w_i^{(b)} \exp (-\alpha^{(b)} y_i \hat{y}^{(b)}({\bf{x}})), y=1,\cdots,n
  7.  |    Set wi(b+1)=wi(b)/j=1nw_i^{(b+1)}\leftarrow =w_i^{(b)} / \sum_{j=1}^{n} (normalisation of weights)

From lines 5 and 6 of the AdaBoost algorithm, we cna draw the following conclusions:


The AdaBoost prediction algorithm is as follows:
Data: BB weak classifiers with confidence values {y^(b)(x),α(b)}b=1B\lbrace\hat{y}^{(b)}({\bf{x}}),\alpha^{(b)}\rbrace_{b=1}^{B} and test input x\bf x_\star
Result: Prediction y^boost(B)(x)\hat{y}_\text{boost}^{(B)}({\bf x_\star})

  1. Output y^boost(B)(x)=sign{b=1Bα(b)y^(b)(x)\hat{y}_\text{boost}^{(B)}({\bf x_\star})=\text{sign}\lbrace\sum_{b=1}^{B}\alpha^{(b)}\hat{y}^{(b)}({\bf{x}_\star}) -->

Gradient Boosting

f(B)(x)=Bb=1α(b)f(b)(x)(7.14)f^{(B)}({\bf x}) = \sum^{b=1}_{B} \alpha^{(b)} f^{(b)}({\bf x}) \tag{7.14}

J(f(X))=1ni=1nL(yi,f(xi))(7.15) J(f({\bf X})) = \frac 1n \sum_{i=1}^{n} L(y_i, f({\bf x}_i)) \tag{7.15}