Research On Training Set Construction Method In Pattern Classification

Posted on:2013-08-29

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Yu

Full Text:PDF

GTID:1268330425966994

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Pattern classification is a process, including describing, recognition, classifying andinterpreting things or phenomena, by processing and analyzing various forms ofinformation of things or phenomenon, and it is the basic intelligent of human as well asanimals. With the continuous growth of the human capacity on collecting and storingdata as well as the rapid development of computing capacity of computers, therequirement of using computers to analyze data and do pattern classification is moreand more widespread and urgent. With further study of researchers, a lot of goodclassification algorithms have been developed in recently years, such as artificial neuralnetworks, support vector machines and decision trees. The emergence of these algorithmsgreatly promotes the application of pattern classification techniques in various fields of life.However, the study of pattern classification is far from being solved. Traditionalclassification algorithms always require that training samples are sufficient and follow thesame distribution with testing samples. However, the training sets in classification problemsare always with many shortcomings, such as small sample problems, imbalanceclassification problems, covariate shift problems and large scale classification problems, sothe precision and efficience of classifiers are not very well. Therefore, it is important toimprove the performance of classifiers on defective data sets.As the performance of classifier depends deeply on the quality of training samples set, itbecomes a good selection to constructe new high quality training sets. For the classificationproblem on low quality training samples set, we perform the following work.Firstly, for the small sample problem, we proposed a virtual sample generation methodbased on Gaussian distribution.The virtual sample generation method use the assumption ofsmoothness in pattern classification, and generate some virtual samples near each originalsample with Gaussian distribution.Thus the original training set can be expanded effectively.As smoothness assumption is the most common priori knowledge in pattern classification,so the proposed method can be applied widely and can guarantee the authenticity of thesamples generated as much as possible. We have proved that for the problems in which thesmoothness assumption is not satisfied, our method is equivalent to regularization method. Experiments on iris data set and sonar data set show that our method can effectivelyimprove the classification performance of classifiers on small sample classificationproblems.Secondly, for the imbalanced data classification problems, we use the proposed virtualsample generation method to generate some samples for the rare class. Thus the samplesbetween each class can be balanced. We have proved that for the problems even thesmoothness assumption is not satisfied, our method is equivalent to cost-sensitive learning.Experiments on kdd cup99intrusion detection data set and sonar data set show that ourmethod can effectively improve the classification performance of classifiers on imbalanceddata classification problems.Then, for the classification under covariate shift setting, we proposed a novel method toextract from the original training set a subset which approximately follows the samedistribution with the testing set. As the proposed method extracted the subset by dividingthe feature space into small subspaces and matching sample numbers between target set andauxiliary set in each feature subspace, the subset extracted from the original training setfollows approximately the same distribution with the testing set and the classificationperformance of classifiers can be improved by training on this subset. Experiments onreconstructive UCI standard data set show that our method can effectively improve theclassification performance of classifiers under covariate shift setting.Finally, for the large scale classification problem, we proposed a support vectorpre-extracting algorithm based on improved vector projection for support vector machinesclassifiers. Firstly, for linear separate problems, we use linear Fisher discriminant tocompute the projection line, while for non-linear separate problems, we use the followingtwo methods to compute the projection line. One is to map the original classificationproblem into high feature space with kernel funcation, and then compute the mean vector asthe projection line. The other is to use kernel Fisher discriminant to compute the projectionline. Then we select a certain ratio samples whose projections are adjacent to those of theother class as bound vectors. Complexity analysis shows that the proposed algorithm is withlow computational and space complexities. Experiments on two artifical data sets and onereal-world data sets show that the proposed algorithm can be almost as accurate as traditional SVMs or sequential minimal optimization (SMO), but is much faster than them.

Keywords/Search Tags:

pattern classification, small sample data, imbalanced data, covariate shift, support vector pre-extraction

PDF Full Text Request

Related items

1	Feature Extraction And Classification Methods For Imbalanced And Small Sample Datasets
2	Research On Feature Analysis Technology For Small Sample Data
3	Research And Application Of Imbalance Data Classification Based On Support Vector Machine
4	Research On Classification Methods For Large-scale Imbalanced Data
5	Research On Classification Algorithm For Imbalanced Data Sets Based On Support Vector Machines
6	Research On Support Vector Machine Classification Method For Imbalanced Datasets
7	Research On Methods Of Imbalanced Data Set Classification
8	Research On Small Sample Data Feature Extraction And Classification Model Of Industrial Equipment
9	Support Vector Machine Based Classification Algorithms Research For Imbalanced Data
10	The Research Of Imbalanced Data Classification Algorithm Based On Support Vector Machine