Lindholm et al, Chapter 7
Bagging and Boosting
- Introduce a new way of constructing models, by constructing instances of some basic model - an "ensemble of base models"
- Core idea in wisdom of crowds by training each model in a slightly different way, they can contribute to learning the input-output features.
- To obtain a prediction use (weighted) average / majority vote.
- If constructed carefully, predictions are better than the prediction of a single base model.
Bagging
- More flexible a model (capable of representing complex input-output relationships) is, the lower the bias will be.
- Downside is the risk of overfitting or high model variance.
- Using Bootstrap Aggregating (bagging) can reduce the variance of the base model without increasing its bias.
- Because of the noise in the training data, we may think of the prediction
as a random variable. - Since in bagging each of the base models is trained on a different 'version' of the training data, the average of multiple realisations of a random variable has lower variance than the random variable itself.
- That is, by taking the average, we obtain a prediction with less variance than a single prediction tree.
Bootstrap
The technique used to (artificially)create the different 'versions' of the base dataset.
- The bootstrapping algorithm is as follows:
Data
: Training dataset
Result
: Bootstrapped data
- For
do - | Sample
uniformly on the set of integers - | Set
and - end
Variance Reduction by Averaging
-
Traditionally used for quantifying uncertainties in statistical estimators.
-
Quantifying can be done by creating subsamples of the original dataset, by sampling with replacement.
-
By running the bootstrapping algorithm
times, we obtain random but identically distributed bootstrapped datasets -
Bootstrapped datasets can then be used to train an ensemble of
base models. refers to regression predictions refers to classification predictions (assuming the base classifiers output class probabilities) - If the base classifiers do not output hard probabilities then we can take a majority vote amongst the base classifiers.
-
Make the observation that averaging random variables reduces the variance.
-
Consider a collection of random variables
. - Let the mean and variance of these variables be
and - Let average correlation between pairs be
- Let the mean and variance of these variables be
-
Mean and the variance of these variables (e.g., ensemble output) given by Equation 7.2a and 7.2b
- We can see that:
- The mean is unaffected
- The variance decreases as
increases - The less correlated the variables are, the smaller the variance
The training of the base models is given in the pseudocode below in Method 7.1
Data Training set
- For
do - | Run Algorithm 7.1 to obtain a bootstrapped training dataset
- | Learn a base model from
- end
- Obtain
or by averaging 7.1.
The prediction using base models is given as:
Data
- For
do - | Use base model
to predict or - end
- Obtain
or by averaging 7.1.
- Despite the number of total parameters in the model increasing as
increases, the lack of overfitting as is expected and is the intended behaviour. - It is important to understand that more ensemble members does not make the resulting model more flexible - it only reduces the variance. -If each ensemble member overfits to its individual bootstrapped version of the training data, the averaging done across all ensemble members will result in larger training error but better generalisation.
Out-of-Bag Error Estimation
- When using bagging or random forest, we can estimate
without using cross-validation - When using bootstrapping, only 63% of original training points will be present in a given bootstrapped training set.
- This means that approximately
ensemble members have not seen that training point - these are "out-of-bag" for that training point. - This point can act as a test point since it has not been seen by any of these ensemble members.
- By computing the error for the out-of-bag ensemble members, we can estimate
for this ensemble denoted . - This single point will be a fairly poor estimate of
, but averaging over all points will give a better estimate. - If
is large enough so that ensembles with and members perform similarly, then will be a good estimate of (which is at least as good as )
- This means that approximately
Random Forests
- Bagging reduce variance by averaging over multiple ensemble of models.
- Variance reduction is limited by the correlation between individual ensemble members
- However, we can reduce the correlation beyond what is possible with bagging which results in the random forest technique.
- Whilst bagging is a general technique for any base model, random forests assume that base models are regression or classification trees.
- Idea is to inject additional randomness to reduce the correlation amongst the base models.
- Let
be one of the bootstrapped datasets in bagging. - To train classification or regression tree, perform training as usual with one key difference
- Whenever about to split a node, do not consider all possible input variables - choose a random subset of
inputs and only consider variables as possible splitting variables.
- This will cause the trees to be less correlated, and can therefore result in a larger variance reduction compared to bagging.
- Important to note that this perturbation (random selection of features to split upon) will result in a variance increase for each ensemble member.
- To understand why it can be a good idea to only consider a subset of inputs as splitting variables, recall that tree-building is based on recursive binary splitting which is a greedy algorithm
- If we construct an ensemble of trees using bagging, it is very likely that the same input variable(s) will be split.
- If using Random Forest instead, then some ensemble members will not be able to split on these inputs and will force these members to split on other inputs.
- This could prove useful later down in the splitting process which may result in better average performance.
- Since Random Forest is a bagging method, bagging methods such as out-of-bag error estimation can be used.
- As with Bagging, taking
doesn't lead to overfitting - the only reason to choose a small value of is to reduce the computational cost. - However, since all trees are identically distributed, it is possible to parallelise the implementation of random forest learning.
- The choice of
is a hyperparameter, for which if then we recover the bagging method described previously. - Rule of thumb,
for classification and for regression.
- Rule of thumb,
Boosting and Ada-Boosting
- Boosting is another ensemble method used for reducing bias in high-bias base models like classification trees of depth 1.
- Built on the idea that even a weak, high-bias model can capture some of the relationship between input and output.
- By training multiple weak models, each of which describing the input-output relationship might be possible to combine the predictions of these models into an overall better prediction
- Reduce the bias by turning an ensemble of weak models into one strong model.
- In bagging, train
identically distributed models in parallel. - Boosting trains models sequentially, in a way that each model tries to correct the mistakes made by the previous.
- Modify the training dataset at each iteration to put more emphasis on data points that the model (so far) has performed poorly on.
- Final prediction obtained by taking weighted average or weighted majority vote on the models (weight is the data penalty from before in example)
AdaBoost
First successful implementation of boosting algorithm
- General idea to construct a sequence of
weak binary classifiers . - For the procedure, only consider final 'hard' predictions from the base models and not predicted class probabilities.
- Ensemble members are not treated equally - assign some positive coefficients
and construct boosted classifier using weighted majority vote.
- Each ensemble member votes either
or and the output from the boosted classifier is if the weighted sum of the individual votes is positive and if it is negative - The coefficient
can be thought of as a degree of confidence in the predictions made by the th ensemble member.
- The coefficient
- In AdaBoost, ensemble members and their coefficients
are trained greedily by minimising the exponential loss of the boosted classifier at each iteration. - The exponential loss function is given as Equation 5.15 and duplicated below, whereby
is the margin of the classifier.
- The ensemble members are added one at a time, and when member
is added, this is done to minimise the exponential loss of the entire ensemble constructed so far. - Exponential loss function chosen is that it results in convenient closed-form expressions.
- The boosted classifier after
iterations is given as where . can be expressed iteratively as
- The ensemble members are constructed sequentially which means that at iteration
, the function is fixed. - Therefore, in iteration
, we must learn the ensemble member and its coefficient .
- Therefore, in iteration
In the equation above:
- Use the definition of exponential loss and the sequential structure of the boosted classifier
- We use the exponential loss function property
to define the following quantity which are interpreted as the individual weights of the data points in the training dataset as per Equation 7.8
- Note that these weights are independent of
and - that is, when learning and its coefficient (by solving Equation 7.7c), we can regard the weights as constants.
- We finally arrive at the following optimisation problem.
- That is, the
th ensemble member should be trained by minimising the weighted misclassification loss, where each datapoint assigned a weight . - The weights force the model to focus on the mistakes (previous misclassifications) by the ensemble of the first
classifiers.
In practice the way that the optimisation problem is solved depends on the choice of base classifier.
-
Solving Equation 7.11 is almost identical to solving the standard classification problem, with the only difference being that the training dataset is weighted.
-
Training an ensemble member on a weighted classification problem (for most base classifiers) is straightforward.
-
Since most classifiers are trained by minimising some cost functions, this problem reduces to weighting the individual terms of the cost function and solving that modified problem.
- That is, we use some modified weighted loss function.
-
The pseudocode for training an AdaBoost classifier is given as:
The AdaBoost training algorithm is given as:
Data
: Training data
Result
:
- Assign weights
to all datapoints - for
do - | Train a weak classifier
on the weighted training data denoted - | Compute
- | Compute
- | Compute
- | Set
(normalisation of weights)
From lines 5 and 6 of the AdaBoost algorithm, we cna draw the following conclusions:
- AdaBoost trains the ensemble by minimising an exponential loss function of the boosted classifier at each iteration - the loss function is shown in Equation 7.5 below.
- The fact that this is an exponential function makes the math work out nicely.
- Part of the derivation shows that we can do the optimisation using the weighted misclassification loss at each iteration (Line 4).
The AdaBoost prediction algorithm is as follows:
Data
:
Result
: Prediction
- Output
-->
Gradient Boosting
- AdaBoost performs well if there is little noise in the data.
- Performance deteriorates if there is a lot of noise in the data (because of exponential loss function).
- Can use a more robust loss function which comes at the cost of computation.
- Instead of saying that we learn a sequence of weak classifiers, where each classifier tries to correct the mistakes made by the previous ones we can say that we construct an "additive model" of the form:
- In this equation
are real-valued coefficients and are basis functions. - For regression,
can directly be used as the model's prediction - For classification, can be threshold-ed by a sign function, or transformed in to class probabilities using a logistic function.
- For regression,
- If basis functions are fixed, then just need to learn the
values. - However, we can obtain a more flexible model by allowing the basis functions to be learnable just as in a NN.
- That is, a two-layer regression NN is an additive model.
- However, we can obtain a more flexible model by allowing the basis functions to be learnable just as in a NN.
- There are two properties that distinguish Boosting from other additive models
- Basis functions are learned from data - each function corresponds to a machine learning model itself - the base model of the boosting procedure
- The basis functions and coefficients are learned sequentially - add one component to the sum in Equation 7.15, and after
iterations, the algorithm terminates.
- The goal when training an additive model is to select
such that the final model minimises the following cost function: