Font Size: a A A

Research On Positive And Unlabeled Learning By Random Forest

Posted on:2015-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:Q ShaoFull Text:PDF
GTID:2298330434460218Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Training classifiers from positive and unlabeled data is called Positive and UnlabeledLearning (PU Learning). Traditional classification algorithms require totally labeled trainingsample to train classifiers. However, in reality, the data are usually totally unlabeled or partlylabeled in small extent. For binary classification problems, except for unlabeled data, positivedata can also be easily acquired in many situations. If classifiers can be trained with smallamounts of positive data and large amounts of unlabeled data and the performance of theseclassifiers is comparable to that of those trained on totally labeled training samples, largeamounts of time and human resources can be saved.To solve the problem of PU Learning, this theses researches how to ensemble thePOSC4.5decision tree classifiers to get a random forest classifier used for PU Learning,which has a high classification performance and requires little time to be trained. The mainresearch achievements of this research are as follows:(1) The POS4.5algorithm is extended to deal with random feature selection. POSC4.5decision tree algorithm is a PU Learning algorithm, which has the advantages of having agood classification performance and having the support of computational learning theory.This theses use POSC4.5as base algorithm of random forest. It is extended firstly, in theprocess of generating a decision tree, a subset of attributes is randomly selected on each treenode, and then the best splitting attribute is selected from the subset by the criterion ofinformation gain calculated under PU Learning.(2) Aiming at two different PU Learning scenarios, two random forest algorithms for PULearning are proposed. In two different PU Learning scenarios, PU train data are generated indifferent ways. This research analyses the computational learning theory foundation ofPOSC4.5algorithm and adopts two different bootstrap sampling methods for two different PULearning scenarios. In one bootstrap sampling method, the positive sample is put in unlabeleddata firstly, and then the whole data set is sampled. In the other one, positive and unlabeleddata are sampled respectively. Based on these two bootstrap sampling methods, two differentrandom forest algorithms for PU Learning are developed.(3) Out bagging error calculated with PU Learning training data is proposed, used toselect an appropriate value of K, the number of random attributes of random forest. In supervised random forest algorithm, out bagging error can be calculated using the trainingdata, which is an unbiased estimator of the generalization error. Out bagging error can be usedto select the parameter K, to generate a classifier with a small generalization error. Thisresearch proposes a method based on model selection criterion of POSC4.5to calculate theout bagging error with PU Learning training data to select an appropriate value of K.Experiments on UCI data show that the proposed algorithm can achieve a higheraccuracy than POSC4.5, bagging POSC4.5and biased Support Vector Machine and is fasterthan biased Support Vector Machine.
Keywords/Search Tags:positive and unlabeled learning, decision tree algorithm, random forest, ensemble learning
PDF Full Text Request
Related items