Research On Positive And Unlabeled Learning By Random Forest

Posted on:2015-04-09

Degree:Master

Type:Thesis

Country:China

Candidate:Q Shao

Full Text:PDF

GTID:2298330434460218

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Training classifiers from positive and unlabeled data is called Positive and UnlabeledLearning (PU Learning). Traditional classification algorithms require totally labeled trainingsample to train classifiers. However, in reality, the data are usually totally unlabeled or partlylabeled in small extent. For binary classification problems, except for unlabeled data, positivedata can also be easily acquired in many situations. If classifiers can be trained with smallamounts of positive data and large amounts of unlabeled data and the performance of theseclassifiers is comparable to that of those trained on totally labeled training samples, largeamounts of time and human resources can be saved.To solve the problem of PU Learning, this theses researches how to ensemble thePOSC4.5decision tree classifiers to get a random forest classifier used for PU Learning,which has a high classification performance and requires little time to be trained. The mainresearch achievements of this research are as follows:(1) The POS4.5algorithm is extended to deal with random feature selection. POSC4.5decision tree algorithm is a PU Learning algorithm, which has the advantages of having agood classification performance and having the support of computational learning theory.This theses use POSC4.5as base algorithm of random forest. It is extended firstly, in theprocess of generating a decision tree, a subset of attributes is randomly selected on each treenode, and then the best splitting attribute is selected from the subset by the criterion ofinformation gain calculated under PU Learning.(2) Aiming at two different PU Learning scenarios, two random forest algorithms for PULearning are proposed. In two different PU Learning scenarios, PU train data are generated indifferent ways. This research analyses the computational learning theory foundation ofPOSC4.5algorithm and adopts two different bootstrap sampling methods for two different PULearning scenarios. In one bootstrap sampling method, the positive sample is put in unlabeleddata firstly, and then the whole data set is sampled. In the other one, positive and unlabeleddata are sampled respectively. Based on these two bootstrap sampling methods, two differentrandom forest algorithms for PU Learning are developed.(3) Out bagging error calculated with PU Learning training data is proposed, used toselect an appropriate value of K, the number of random attributes of random forest. In supervised random forest algorithm, out bagging error can be calculated using the trainingdata, which is an unbiased estimator of the generalization error. Out bagging error can be usedto select the parameter K, to generate a classifier with a small generalization error. Thisresearch proposes a method based on model selection criterion of POSC4.5to calculate theout bagging error with PU Learning training data to select an appropriate value of K.Experiments on UCI data show that the proposed algorithm can achieve a higheraccuracy than POSC4.5, bagging POSC4.5and biased Support Vector Machine and is fasterthan biased Support Vector Machine.

Keywords/Search Tags:

positive and unlabeled learning, decision tree algorithm, random forest, ensemble learning

PDF Full Text Request

Related items

1	Decision Tree Generation Algorithm Based On Time Series Pairs
2	Symbiotic Forest:A Lightweight Decision Tree Ensemble Method
3	The Application Of Random Forest Algorithm In Body Posture Recognition Research
4	A Study On Improvement And Applications Of Random Forest Classification Algorithm
5	Forestnet: A Learning Architecture Combining Deep Networks And Decision Forest
6	Research On Decision Tree Algorithm Based On Rough Sets And Ensemble Learning
7	Application Of The Random Forest Classification Algorithm In Edge Detection
8	A Study On Learning From Positive And Unlabeled Examples
9	Research On Decision Forest Algorithm Based On Attribute Reduction
10	Bayesian Classifier For Positive Unlabeled Learning With Uncertainty