Font Size: a A A

Research On Semi-supervised Classification Algorithm Based On Clustering Ensemble

Posted on:2019-06-03Degree:MasterType:Thesis
Country:ChinaCandidate:H J WangFull Text:PDF
GTID:2438330572455968Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In many real big data situations,it is convenient and cheap to acquire sample properties through sensors,while it is difficult and expensive to give their labels by specialists.Which leads to there are more unlabeled data than labeled data.If we study only through a few labeled samples,the disadvantages of classification model that is acquired by traditional supervised learning algorithm training always appear,like the under fitting problem,so as to bring a low accuracy of classification.Semi-supervised classification fully utilizes unlabeled samples and improves classifiers accuracy,which is extensively used in many fields,such as intelligence information process,image processing,and life science.Its main directions include difference-based methods,generative methods,discriminant methods and picture-based methods.All these methods are equipped with many characteristics,such as a superior performance,complete mathematic theories,an advanced computation speed and a superior accuracy of classification.However,they don't fully take account of the indeterminacy and complexity of the classification of unlabeled data,resulting in a poor stability and robustness of algorithm.Ensemble learning can reduce the indeterminacy of unlabeled samples during the labeling procedure in the semi-supervised classification,optimize the semi-classified decision boundary problems and improve the anti-jamming capability and reliability of algorithm.However,the semi-supervised learning is suitable for lacking labeled samples,while the traditional ensemble learning itself needs a great amount of labeled samples to be trained,so that there are contradictions between both learning ways.Based on that,this passage puts forward a semi-supervised classification method on the basis of cluster integration.On the one hand,it improves the stability of semi-supervised classification.On the other hand,it solves the contradictions of the needs of label samples between the semi-supervised classification and ensemble learning.The method includes these two following algorithms:1.The ?-Means Clustering Algorithm is based on Initial Center Optimization and Feature Weighted(COFW).The ?-Means is a typical unsupervised clustering algorithm with some faults:initial clustering centers that are chosen randomly always result in unstable clustering outcomes;the values of significant properties will not be highlighted better,since all properties are regarded in a unified and equal way.The COFW utilizes the whole new initial clustering centers to choose methods to acquire k initial clustering centers,and initially clusters them with initial property weights;then,it acquires the property weights from the contribution of sample properties to clustering and adjusts the property weights in line with clustering accuracy,and clusters again;it repeats the above procedures until there is no convergence of clustering accuracy,so as to obtain the final clustering outcomes.2.The Semi-supervised Binary Classification is based on Clustering Ensemble(SUCE).It is no way for existing clustering algorithm to directly act on classification,so that it's imperative to fully utilize labeled sample information to help classify.However,during this procedure,the marks of unlabeled sample are uncertain to result in the instability of classification performance.Under different parameter settings,the SUCE adopts a great amount of basic cluster results that are generated from clustering algorithms,such as COFW,k-Means,EM,FarthestFirs,and HierarchicalClusterer,to assess and choose basic cluster results and acquire clustering labels of samples on the one step;then,it utilizes labeled samples to acquire their predictive labels;finally,it acquires consistent labels through the ensemble learning of the predictive labels.In other words,it presorts test sets through integrating basic cluster results and puts samples with high confidence into training sets;then classifies test set samples by utilizing expanded training sets and using basic supervised learning algorithms,such as C4.5,Naive Bayes,kNN,Logistic,OneR.The test adopts real data sets in the UCI database.Through a large amount of parameter adjustment and the comparison with existing algorithm,the result shows:1)The COFW has higher clustering performance than k-Means;2)The SUCE can improve the classification accuracy of basic classified algorithms;3)As for extremely limited training samples,the classification accuracy of the SUCE significantly improves.
Keywords/Search Tags:Semi-supervised classification, Ensemble learning, Clustering, k-Means, Feature weighted, Initial clustering center optimization
PDF Full Text Request
Related items