Font Size: a A A

Large Scale Classification Algorithms Based On Clustering Feature Trees

Posted on:2013-01-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:G C WuFull Text:PDF
GTID:1118330374476447Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the core content of pattern recognition, machine learning and data mining,classification has been widely applied to practical problems such as text classification, webpage classification, speech recognition, image recognition and biological informationprocessing, etc. With the rapid development of information technology, massive data obtainedfrom Internet and digital devices pose severe challenges for traditional algorithms, in terms ofcomputation and memory requirements. Therefore, how to deal with large scale classificationproblems has become a hot topic in the research community.Based on the study of the current state-of-art of large scale classification, this dissertationinvestigates the large scale supervised and semi-supervised classification algorithms byorganizing samples with clustering feature (CF) trees and using local or global learningstrategies. The contributions are as follows:Firstly, we propose the concept of clustering feature with label (CFL) and design a CFLtree and local learning based support vector machine (CFLL-SVM) algorithm for large scaleclassification. The CF tree, originally an unsupervised clustering algorithm, can compress anddivide samples efficiently. It is improved to the CFL tree for supervised classification bycombining unsupervised clustering and supervised clustering methods. Through the CFL tree,CFLL-SVM divides the entire training sample set into small subsets, and trains asub-classifier for each subset based on a local learning strategy, and forms an overall classifierfrom all the sub-classifiers. Moreover, we analyze the effects of parameters associated withCFLL-SVM. Experimental results show that CFLL-SVM can accelerate the training processwithout loss of accuracy.Secondly, we propose a CF tree and progressive labeling based semi-supervised supportvector machine (CFPL-S3VM) algorithm for large scale semi-supervised classification. Inmany practical problems, unlabeled samples are available in large quantity and easy to collectwhile labeled samples are difficult. In order to utilize of the tremendous amount of unlabeledsamples to discover the intrinsic structure of the given samples, CFPL-S3VM uses the CF treeto organize the unlabeled samples. Then based on a coarse-to-fine global learning strategy, itbuilds a semi-supervised support vector machine classifier respectively at each layer from theroot to the leaf of the CF tree by using the cluster centers, which are the representatives of theunlabeled samples, and the labeled samples. At each layer, it gets the potential supportclusters, which affect the decision hyperplane, and prunes the useless clusters to reduce thelearning scale of the next layer. Moreover, the summaries of pruned clusters are retained via labeling the cluster centers to avoid the loss of pruned information leading to low accuracy.Experimental results show that on the sample sets with a limited number of support vectors,CFPL-S3VM can significantly reduce the training time while maintaining the same accuracylevel.Thirdly, we propose an algorithm for large scale semi-supervised classification based onCF tree and local graph transduction (CF-LGT). While the classification boundary of asample set is very complex and there are many support vectors, CFPL-S3VM is not capable ofreducing the learning scale via pruning. Therefore, we introduce a local learning strategy andapply the CF tree to decompose the unlabeled samples to a series of subsets. For each subset,CF-LGT constructs a novel sparse and anti-noise nearest neighbor graph and predicts thelabels of the samples by a graph-based method. As a result, it saves the memory andaccelerates the learning process. Experimental results show that on the sample sets with acertain scale of labeled samples, CF-LGT demonstrates the advantages of low memory cost,high classification accuracy and short learning time.Fourthly, we propose a local learning algorithm integrating global structure (LLGS) forlarge scale semi-supervised classification. While the labeled samples are rare or notdistributed uniformly, the local learning strategy of CF-LGT will destroy the global structureof the samples which leads to a low classification accuracy. Therefore, LLGS first applies theCF tree to decompose the unlabeled samples to a series of subsets, and then retrieves theglobal structural information and integrates such information into each local subset. At last itpredicts the labels of the samples locally using a graph-based method. Experimental resultsshow that even on the sample sets with few labeled samples, LLGS shows high classificationaccuracy and low memory cost. In addition, LLGS is inductive and can predict the labels ofout of the samples.
Keywords/Search Tags:Large scale classification, Clustering feature tree, Support vector machine, Semi-supervised support vector machine, Graph-based method
PDF Full Text Request
Related items