Font Size: a A A

Research On A Semi-supervised Random Forest Classification Algorithm And Its Parallelization

Posted on:2018-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:C MaFull Text:PDF
GTID:2348330536969494Subject:Engineering
Abstract/Summary:PDF Full Text Request
Machine learning is one of the core content in the research of artificial intelligence which contains three important research fields: supervised learning,unsupervised learning and semi-supervised learning.In order to get the learning model,Supervised learning trains only using labeled points,but which need too many samples.Unsupervised learning fully uses samples without label,but it can't guarantee the accuracy of learning model.However in the real life,the labeled samples need be labeled by worker,which are very expensive and less.But the samples without label are cheap and numerous.So the semi-supervised learning which can constructs classification model using a few labeled samples and many unlabeled samples has become a hot research topic of machine learning.In the areas of semi-supervised learning,collaborative training is a learning paradigm with more achievements,the algorithm of this paper also bases on collaborative training.However most existing collaborative training algorithms have two problems: Firstly,the classifiers can't have enough diversity because of the quantity limitation.Secondly,when there are only a few labeled samples or the performance of initial classifiers is poor,the noise of the new labeled samples will be too much.As a result,the follow-up training can't be allowed to continue even with a bad model.The research results of this paper include:(1)In view of the first problem,this paper introduces the idea of decision tree grouping to the semi-supervised random forest classification algorithm,which not only ensures the enough number of decision tree but also ensures the diversity of decision trees.In order to prevent the lack of difference between the auxiliary decision trees,the decision trees are divided into tree groups.Take turns to treat one of the groups as main classifier,the other groups as auxiliary classifier.In this way the difference between auxiliary classifier will be enough.As a result,the difference between new labeled samples will be enough.Then the decision trees will be various.(2)In view of the second problem,In order to prevent too much noise in new labeled samples,this paper introduces the stage of data editing.The new labeled samples will be cut by the KNN algorithm.By this way,the noise of the new labeled samples will has a certain extent reduce.(3)To cope with the challenge of big data.In this paper,the semi-supervised random forest is extended to the platform of Spark which is a distributed environment.The serialization steps are modified to adapt the distributed memory calculate,which can make full use of the big data set.(4)In this paper,the improved algorithm is compared with the existing algorithms through a large number of experiments.In the stand-alone environment,this paper uses nine standard data sets.The results of the experiments show that our algorithm has a obvious improvement.The experiment in the Spark environment is base on a open source big data set.The experiment can achieve the result we expected,using cheap and unlabeled samples to strengthen the learning model continuously.
Keywords/Search Tags:Semi-Supervised Learning, Data Editing, Collaborative Training, Spark
PDF Full Text Request
Related items