COMP4702 Lecture 8

Bagging and Boosting

Bagging

Borrowing the idea of bootstrapping from statistics

Bootstrapping


  1. For i=1,,ni=1,\cdots, n do
  2.  |   Sample \ell uniformly on the set of integers {1,,n}\lbrace1,\cdots,n\rbrace
  3.  |   Set x~i=x\tilde{\bf{x}}_i={\bf{x}}_\ell and y~i=y\tilde{y}_i=y_\ell
  4. end

Bootstrapping Example

Figure 1 Visualisation of Data for Bootstrapping Example

Figure 2 Created 9 bootstrapped trees from the above dataset

Figure 3 Observe that the resulting model (Right)

How does Bagging reduce the variance of models?

E=[1Bb=1Bzb]=μ(7.2a)\mathbb{E}=\left[\frac{1}{B}\sum_{b=1}^{B} z_b\right]=\mu\tag{7.2a}
Var=[1Bb=1Bzb]=1ρBσ2+ρσ2(7.2b)\text{Var}=\left[\frac{1}{B} \sum{b=1}{B} z_b\right]=\frac{1-\rho}{B} \sigma^2 + \rho\sigma^2\tag{7.2b}

Figure 3 Observe that the resulting model (Right)

Out-of-Bag Error Estimation

Random Forests

Random Forest v Bagging Example

Example 7.4 from Lindholm et al.

Figure 4 - Dataset for random forest vs bagging example.

Boosting

Example: Bagging vs Boosting

Figure 5 - Decision boundary for a random forest and bagged decision tree classifier

Figure 6 - Compilation of ensemble members for bagging and random forest

Figure 7 - E new for the random forest and bagging models.

Example: Boosting Minimal Example

Figure 8 - E new for the random forest and bagging models.

Adaboost

An old boosting algorithm that is still quite useful.

The AdaBoost training algorithm is given as:
Data: Training data T{xi,yi}i=1n\mathcal{T}-\lbrace{\bf{x}}_i, y_i\rbrace_{i=1}^{n}
Result: BB weak classifiers

  1. Assign weights wi(1)=1nw_i^{(1)}=\frac{1}{n} to all datapoints
  2. for b=1,,Bb=1,\cdots,B do
  3.  |    Train a weak classifier y^(b)(x)\hat{y}^{(b)}({\bf{x}}) on the weighted training data denoted {(xi,yi,wi(b))}i=1n\lbrace ({\bf{x}}_i, y_i, w_i^{(b)})\rbrace_{i=1}^{n}
  4.  |    Compute Etrain(b)=i=1nwi(b)I{yiy^(b)(xi)}E_\text{train}^{(b)}=\sum_{i=1}^{n} w_i^{(b)} \mathbb{I}\lbrace y_i\ne\hat{y}^{(b)}({\bf{x}}_i)\rbrace
  5.  |    Compute α(b)=0.5ln((1Etrain(b))/Etrain(b))\alpha^{(b)}=0.5\ln((1-E_\text{train}^{(b)})/E_\text{train}^{(b)})
  6.  |    Compute wi(b+1)=wi(b)exp(α(b)yiy^(b)(x)),y=1,,nw_i^{(b+1)}=w_i^{(b)} \exp (-\alpha^{(b)} y_i \hat{y}^{(b)}({\bf{x}})), y=1,\cdots,n
  7.  |    Set wi(b+1)=wi(b)/j=1nw_i^{(b+1)}\leftarrow =w_i^{(b)} / \sum_{j=1}^{n} (normalisation of weights)

From lines 5 and 6 of the AdaBoost algorithm, we cna draw the following conclusions:


Design Choices for Adaboost

Gradient Boosting

A newer technique compared to AdaBoost

f(B)(x)=b=1Bα(b)f(b)(x)(7.14)f^{(B)}({\bf x}) = \sum_{b=1}^{B} \alpha^{(b)} f^{(b)}({\bf x})\tag{7.14}

In boosting:

Training via Gradient Boosting

J(f(X))=1ni=1nL(yi,f(xi))(7.15)J(f({\bf X}))=\frac 1n \sum_{i=1}^{n} L(y_i, f({\bf x}_i)) \tag{7.15}