Research On A Semi-supervised Random Forest Classification Algorithm And Its Parallelization

Posted on:2018-12-16

Degree:Master

Type:Thesis

Country:China

Candidate:C Ma

Full Text:PDF

GTID:2348330536969494

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Machine learning is one of the core content in the research of artificial intelligence which contains three important research fields: supervised learning,unsupervised learning and semi-supervised learning.In order to get the learning model,Supervised learning trains only using labeled points,but which need too many samples.Unsupervised learning fully uses samples without label,but it can't guarantee the accuracy of learning model.However in the real life,the labeled samples need be labeled by worker,which are very expensive and less.But the samples without label are cheap and numerous.So the semi-supervised learning which can constructs classification model using a few labeled samples and many unlabeled samples has become a hot research topic of machine learning.In the areas of semi-supervised learning,collaborative training is a learning paradigm with more achievements,the algorithm of this paper also bases on collaborative training.However most existing collaborative training algorithms have two problems: Firstly,the classifiers can't have enough diversity because of the quantity limitation.Secondly,when there are only a few labeled samples or the performance of initial classifiers is poor,the noise of the new labeled samples will be too much.As a result,the follow-up training can't be allowed to continue even with a bad model.The research results of this paper include:(1)In view of the first problem,this paper introduces the idea of decision tree grouping to the semi-supervised random forest classification algorithm,which not only ensures the enough number of decision tree but also ensures the diversity of decision trees.In order to prevent the lack of difference between the auxiliary decision trees,the decision trees are divided into tree groups.Take turns to treat one of the groups as main classifier,the other groups as auxiliary classifier.In this way the difference between auxiliary classifier will be enough.As a result,the difference between new labeled samples will be enough.Then the decision trees will be various.(2)In view of the second problem,In order to prevent too much noise in new labeled samples,this paper introduces the stage of data editing.The new labeled samples will be cut by the KNN algorithm.By this way,the noise of the new labeled samples will has a certain extent reduce.(3)To cope with the challenge of big data.In this paper,the semi-supervised random forest is extended to the platform of Spark which is a distributed environment.The serialization steps are modified to adapt the distributed memory calculate,which can make full use of the big data set.(4)In this paper,the improved algorithm is compared with the existing algorithms through a large number of experiments.In the stand-alone environment,this paper uses nine standard data sets.The results of the experiments show that our algorithm has a obvious improvement.The experiment in the Spark environment is base on a open source big data set.The experiment can achieve the result we expected,using cheap and unlabeled samples to strengthen the learning model continuously.

Keywords/Search Tags:

Semi-Supervised Learning, Data Editing, Collaborative Training, Spark

PDF Full Text Request

Related items

1	Research And Implementation Of Semi-Supervised Based Self-Training Classification Model
2	The Research Of Collaborative Recommendation Algorithm Based On Semi-Supervised Learning
3	Research On Semi-supervised Self-training Method
4	Research On Semi-supervised Learning Classification Algorithm
5	Study On Semi-supervised Recommendation Method Based On Co-training
6	Comparison And Improvement Of Two Methods Based On Semi-Supervised Learning
7	Comparison And Improvement Of Two Methods Based On Semi-supervised Learning
8	Research Of Reliable Semi-supervised Classification
9	Research And Application Of Semi-Supervised Learning Algorithms Based On "Collaborative-Participatory" Computational Cognition Model
10	Learning from partially labeled data: Unsupervised and semi-supervised learning on graphs and learning with distribution shifting