In prediction problems, we often fit one model, evaluate its performance, and test it on unseen data. But what if we could combine multiple models at once and leverage their combined performance? This is the spirit of “boosting”: creating an ensemble of learning algorithms, which perform better together than each does independently. Here, we’ll give a quick overview of boosting, and we’ll review one of the most influential boosting algorithms, AdaBoost.
Consider a binary classification problem based on some data $\{(x_i, y_i)\}_{i=1}^n$ where $y_i \in \{0, 1\}$. Suppose you have a few decision rules that you can use to classify your data points, but each of the decision rules is not very accurate (e.g., each barely performs above random guessing). Boosting provides a systematic way to combine these “weak” classifiers into a “strong” classifier that performs much better.
To begin to understand the theory behind boosting, we should first define the difference between “strong learners” and “weak learners”. Intuitively, a strong learner is a learning algorithm that achieves close to perfect prediction. A weak learner is one that achieves prediction performance barely above random chance.
The formal definitions of strong and weak learners are as follows (as described in Prof. Elad Hazan’s book):
\[\text{error}(\mathcal{A}(S_T)) \leq \epsilon.\]A hypothesis class $\mathcal{H}$ is strongly learnable if the following holds. There exists an algorithm $\mathcal{A}$ that accepts $S_T = {(x_t, y_t), t \in [T]}$ and returns hypotheses $\mathcal{A}(S_T) \in \mathcal{H}$ that satisfies: for any $\epsilon, \delta > 0$ there exists a sufficiently large natural number $T = T(\epsilon, \delta)$ such that for any distribution $\mathcal{D}$ over pairs $(x, y)$ and $T$ samples from this distribution, it holds with probability at least $1 - \delta$
In words, a strong learner is one that achieves very small error. This is largely equivalent to the definition of PAC learnability.
On the flip side, the formal definition of a weak learner is as follows.
A hypothesis class $\mathcal{H}$ is $\gamma$-weakly learnable if the following holds. There exists an algorithm $\mathcal{A}$ that accepts data $S_T = {(x_t, y_t), t \in [T]}$ and returns a hypothesis $\mathcal{A}(S_T) \in \mathcal{H}$ that satisfies: for any $\delta > 0$ there exists a sufficiently large natural number $T = T(\delta, \gamma)$ such that for any distribution $\mathcal{D}$ over pairs $(x, y)$ and $T$ samples from this distribution, it holds with probability at least $1 - \delta$
\[\text{error}(\mathcal{A}(S_T)) \leq \frac12 - \gamma.\]
In the late 80s, there was an open theoretical question about whether weak and strong learners were actually distinct. If we could build a classifier for a problem just barely above random guessing (i.e., a weak learner exists), does that mean that we can classify arbitrarily well (i.e., a strong learner exists)? Or were there problems for which a weak learner exists, but not a strong learner? This is the concept of “boosting” as it was first described by Michael Kearns and Leslie Valiant: whether a slight advantage in performance could be “boosted” to make the algorithm arbitarily accurate.
In a 1990 seminal paper in machine learning, Rob Schapire found a surprising conclusion: The existence of a weak learner implies the existence of a strong learner. In other words (since a strong learner trivially implies a weak learner),
\[\text{Weak learnability} \iff \text{Strong learnability}.\]Schapire showed how to transform a set of weak learners into a unified algorithm that could perform arbitrarily well. This was a groundbreaking result because it implied that there is a generic, practical method for improving performance of any black-box algorithm.
Many variations of boosting have since been proposed, with the central difference between them being the way in which the base (weak) learners are combined. One of the most significant algorithms is called AdaBoost (developed by Schapire), which we now review.
AdaBoost is a particular instance of a boosting algorithm that finds an optimal weighting for each base learner. The basic premise of AdaBoost is this: given a set of classifiers $\{\delta_j\}_{j = 1}^m$ and data $\{(x_i, y_i)\}_{i=1}^n$, we iteratively refit each of the classifiers, where on each iteration we place more weight on examples that were misclassified in earlier iterations. By placing more weight on “difficult” examples, we can encourage the learners to focus more of their attention on these examples. Then, we can compute a weighting scheme for the learners themselves, and use these weights to output a final prediction.
A key attribute of AdaBoost is its exponential weight update schedule. In particular, we maintain a set of weights $\{w_i\}_{i=1}^n$ over the examples (we can think of this as a distribution over examples if we normalize it). On each iteration, if the learner incorrectly classifies example $x_i$, we update its weight as
\[w_i^{(t+1)} = w_i^{(t)} e^{\alpha_t \mathbf{I}(\hat{y}_i \neq y_i)}\]where $\alpha_t = \frac{1 - \text{error}}{\text{error}}$ is the log-odds of the error of the current classifier. Notice that the weight only changes if the guess $\hat{y}_i$ is incorrect – otherwise, the weight remains unchanged. Furthermore, the weight increases more when the error is lower,
The full algorithm in Python is below. We show an example with the sklearn breast cancer dataset.
from sklearn.datasets import load_digits, load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
import numpy as np
# ---- Load data -----
data = load_breast_cancer()
X = data.data
y = data.target
n, p = X.shape
# ---- AdaBoost ----
NUM_ITER = 10
weights = np.repeat(1/n, n) # initial weights are uniform across samples
for ii in range(NUM_ITER):
# Fit LR with current weights
clf = DecisionTreeClassifier(random_state=0, max_depth=1).fit(X, y, sample_weight=weights)
pred = clf.predict(X)
err = np.dot(weights, (pred != y)) / np.sum(weights)
# Compute log odds of model error (measure of current model performance)
alph = np.log((1 - err) / err)
# Update weights
weights = weights * np.exp(alph * (pred != y))
# Final classifier
preds = np.array(preds) * 2 - 1
final_preds = (np.sign(np.matmul(preds.T, alphas)) + 1) / 2
print("Final accuracy: {}".format(np.mean(final_preds == y)))