Caltech
Center for Neuromorphic Systems Engineering

Home
Research
News
People

[back]

Optimization and Generalization in Boosting
Ling Li, Yaser Abu-Mostafa

Abstract. The superior out-of-sample performance of AdaBoost has been attributed to the fact that it minimizes a cost function based on margin. In order to examine how the cost function, in and of itself, affects the out-of-sample performance, we apply several more sophisticated optimization techniques directly to the cost function. When the AdaBoost exponential cost function is optimized, our methods generally yield much lower cost and training error but higher test error, which implies that the exponential cost is vulnerable to overfitting. With the optimization power gained, we can adopt more "regularized" cost functions that have better out-of-sample performance but are difficult to optimize. Our experiments demonstrate that with suitable cost functions, our methods can have better out-of-sample performance.

Motivation and Aims. AdaBoost is probably the most popular algorithm among the boosting family which generates a linear combination of weak hypotheses. It iteratively adds hypotheses generated by a given weak learner to the linear combination, emphasizes difficult examples by giving them higher sample weights, and favors hypotheses with lower training errors by giving them larger coefficients. AdaBoost can be viewed as a special case of AnyBoost, a general gradient descent in a function space.

It has been observed experimentally that AdaBoost keeps improving the out-of-sample error even after the training error of the linear combination has reached zero. One explanation to this is that AdaBoost improves the margins of the training examples even after all the examples have positive margins, and larger margins imply better out-of-sample performance. However, this explanation was challenged by LPBoost which achieves larger minimum margins than AdaBoost, but did not have better out-of-sample performance than AdaBoost, mostly worse. Another related explanation is that AdaBoost optimizes a cost function based on example margins. Although there is a theoretical bound on the out-of-sample error based on cost, it is still unclear whether minimizing the cost is helpful in practice.

Our goal is to take a closer look at this question, examining how the cost function, in and of itself, affects the out-of-sample performance.

Research. Our research on the cost function consists of two related parts: optimization and generalization. The optimization part considers how fast and to what degree the cost can be reduced; The generalization part investigate how the out-of-sample error is affected by the cost function, via its shape and value.

For the optimization part, we design three techniques which generally achieve much lower cost than AdaBoost with the same boosting size. The techniques can also be used for other cost functions thus are beneficial to general boosting algorithms.

1. RotBoost: AdaBoost is an iterative process where previously generated hypotheses are untouched by further learning. To break this limit, we randomly "kick out" hypotheses from the linear combination and append new hypotheses to the combination based on the rest of them.

2. alphaBoost. After the linear combination of weak hypotheses is generated by normal AdaBoost, the cost can be further decreased via optimizing the linear coefficients.

3. CGBoost. Instead of gradient descent in the function space, we apply conjugate gradient to the cost in the function space. The descent direction is generally also a linear combination of weak hypotheses.

We observe that these methods generally achieve much lower cost and training error but higher test error (see Figure 1), which implies that the exponential cost is vulnerable to overfitting.

Figure 1. Performance of alphaBoost and CGBoost (solid curves) vs. AdaBoost (dashed curves). (a) alphaBoost decreases the cost further by optimizing coefficients of the first 150 hypotheses; (b) CGBoost decreases the cost faster by adopting a more sophisticated optimization technique. However, they both have worse test errors.

This observation implies that the shape of the cost function might have big impact on the out-of-sample performance. We thus introduce the bisigmoid cost function which has a flatter part for negative margins. This time, our method yields both better in-sample and out-of-sample errors (see Figure 2).

Figure 2. With bisigmoid cost, CGBoost can achieve both lower cost and test error.

Achievements

We obtain three sets of results:

1. The introduction of new boosting algorithms, namely RotBoost, alphaBoost, and CGBoost, which have better cost optimization performance than AnyBoost.

2. The conclusion that AdaBoost cost function is much more vulnerable to overfitting when it is directly minimized instead of being minimized within the confines of the AdaBoost algorithm.

3. The identification of more "regularized" cost functions whose direct minimization results in a better out-of-sample performance than that of the AdaBoost cost function.

References
CGBoost: Conjugate Gradient in Function Space. L. Li, Y. S. Abu-Mostafa, and A. Pratap. Submitted to NIPS'03.

 

top