Topics in imbalanced data classification : AdaBoost and Bayesian relevance vector machine
This research has three parts addressing classification, especially the imbalanced data problem, which is one of the most popular and essential issues in the domain of classification. The first part is to study the Adaptive Boosting (AdaBoost) algorithm. AdaBoost is an effective solution for classification, but it still needs improvement in the imbalanced data problem. This part proposes a method to improve the AdaBoost algorithm using the new weighted vote parameters for the weak classifiers. Our proposed weighted vote parameters are determined not only by the global error rate but also by the classification accuracy rate of the positive class, which is our primary interest. The imbalanced index of the data is also a factor in constructing our algorithms. The numeric studies show that our proposed algorithms outperform the traditional ones, especially regarding the evaluation criterion of the F--1 Measure. Theoretic proofs of the advantages of our proposed algorithms are presented. The second part treats the Relevance Vector Machine (RVM), which is a supervised learning algorithm extended from the Support Vector Machine (SVM) based on the Bayesian sparsity model. Compared with the regression problem, RVM classification is challenging to conduct because there is no closed-form solution for the weight parameter posterior. The original RVM classification algorithm uses Newton's method in optimization to obtain the mode of weight parameter posterior, then approximates it by a Gaussian distribution in Laplace's method. This original model would work, but it just applies the frequency methods in a Bayesian framework. This part first proposes a Generic Bayesian RVM classification, which is a pure Bayesian model. We conjecture that our algorithm achieves convergent estimates of the quantities of interest compared with the nonconvergent estimates of the original RVM classification algorithm. Furthermore, a fully Bayesian approach with the hierarchical hyperprior structure for RVM classification is proposed, which improves the classification performance, especially in the imbalanced data problem. The third part is an extended work of the second one. The original RVM classification model uses the logistic link function to build the likelihood, which makes the model hard to conduct since the posterior of the weight parameter has no closed-form solution. This part proposes the probit link function approach instead of the logistic one for the likelihood function in RVM classification, namely PRVM (RVM with the Probit link function). We show that the posterior of the weight parameter in our model follows the multivariate normal distribution and achieves a closed-form solution. A latent variable is needed in our algorithm to simplify the Bayesian computation greatly, and its conditional posterior follows a truncated normal distribution. Compared with the original RVM classification model, our proposed one is another pure Bayesian approach and it has a more efficient computation process. For the prior structure, we first consider the Normal-Gamma independent prior to propose a Generic Bayesian PRVM algorithm. Furthermore, the Fully Bayesian PRVM algorithm with a hierarchical hyperprior structure is proposed, which improves the classification performance, especially in the imbalanced data problem.