Font Size: a A A

The Study Of Complex Data Processing Method Based On Classification

Posted on:2014-02-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Y WangFull Text:PDF
GTID:1268330425979615Subject:Basic mathematics
Abstract/Summary:PDF Full Text Request
Pattern classification is one of the core technologies of machine learning. The samples with consistent properties are projected onto a given class, which is modeled as a specific classifier. In recent years classification has many important research achievements, such as decision tree, Bayes classifier, neural networks, genetic algorithms, support vector machine, etc. However, with the development of application fields, classification needs deal with more and more complicated and diversified data sets, which make increasingly challenges for classifier design.For better classification, this paper is mainly about the research on how to deal with the high dimensionality and small sample size (SSS) data sets, the complicated data with heterogeneous features, and on how to make full use of the potential structural information of data distribution. The main works are listed as follows:(1) To deal with dimensionality reduction of the high dimensional data, a novel K-NN classification algorithm based on orthogonal linear discriminant analysis (O-LDE) is proposed. Firstly, construct two neighborhood graphs which can best keep the local neighborhood information of the within-class and between-class for the data; secondly, to overcome SSS problem, rewrite the affinity matrices and modify the objective function of optimization; thirdly, complete the embedding from high-dimensional space to low-dimensional space by producing orthogonal basis vectors to solve the optimization; finally, classification with k-NN in the low-dimensional embedding subspace. After dimensionality reduction with O-LDE, the data points of the same class maintain their intrinsic neighbor relations, whereas the neighboring points of the different classes are far from each other. Experimental results on the public tumor dataset Leukemia validate the better performance of the proposed algorithm than LDA, LLDE and LDE.(2) To solve the difficulty of making the classification decision with the same classifier on the complicated heterogeneous data, a novel Bayesian ensemble algorithm based on grouped feature subsets selection (EGFS+BC) is proposed. Firstly, group the features by their different type of resource and randomly extracte part of features as the initial feature subsets for each group; secondly, complete dynamic feature subsets selection in accordance with the strategy of improving the accuracy and diversity of the base classifiers; finally, integration of the favorable Bayes base classifiers trained with the selected feature subsets, under the framework of ensemble learning of weighted voting. EGFS+BC take advantage of the difference and complementarity between the diversified features. Experimental results on the public heterogeneous data DDSM demonstrate that our classification scheme outperforms many single traditional classifiers, such as k-NN、Boost C5、Neural Net et al.(3) To make better use of the structural information for classification, a kind of grouped SVM algorithm based on sample space partition is proposed, including clustered group SVM (GC-SVM) and grouped fuzzy SVM with EM-based sample space partition (EMG-FS VM). Firstly, partition the sample space of the positive and negative class into several subsets respectively for clear description the structural information of with-class, according to the certain similarity measure criterion (such as k-means and EM clustering); secondly, train different sub SVM classifier with the combining samples from each subsets with different class label; finally, predict the unknown sample with the specific SVM classifier selected by the Mahalanobis distance between the new sample and the center of each subsets. This integrated classification framework casts a difficult two-class classification problem into a serial simple two-class sub problems, which shorten the training time and improve the speed of classification. Experiments on both synthetic and real clustered microcalcification detection datasets show that the proposed integrated classification framework has much superior performance and stability than different kernel SVMs.
Keywords/Search Tags:pattern classification, small sample size, dimensionality reduction, heterogeneous data, selective ensemble learning, structural information ofdata, sample space partition
PDF Full Text Request
Related items