Font Size: a A A

Feature Extraction And Classification Methods For Imbalanced And Small Sample Datasets

Posted on:2013-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:C G TaoFull Text:PDF
GTID:2268330392967986Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the expanding application areas of machine learning and data mining, peoplehave encountered more and more imbalanced and small sample data recently. Smallsample is the data whose size is small compared to the dataset dimension. Imbalancedsample is the data that the sizes of different class are imbalanced. High-dimensional andsmall sample data bring lots of challenges to traditional machine learning algorithms.The time and space resource that needed to construct the model is demanding. Inaddition, the imbalanced samples also bring great difficulties to the traditional PatternAnalysis Algorithms. Traditional learning algorithms are built on the basis of balanceddatasets. The performance of the learning algorithms will be reduced rapidly whenconfronted with imbalanced data.In this paper, we will discuss and research the methods to deal with these datasets:Firstly, we use classical feature extraction algorithms to extract features for smallsample data in order to reduce the dimension of the data. For certain parameters settingsof the feature extraction algorithms, particle swarm optimization algorithm will beintroduced to achieve automatic optimization of parameters and we discard the existingparameters setting methods based on experience. Feature extraction algorithms havedivision of linear, nonlinear, supervision and nonsupervision editions. In this paper, wepropose an algorithm to combine different feature extraction algorithms based ondecision-making level. We can promote advantages and abolish disadvantages of eachfeature extraction method. For assessment methods of feature extraction algorithm, weuse the recognition rate of the support vector machine classifier constructed on featureextraction results to measure the performance of extraction results.Secondly, for imbalanced issue, we start our research work from two aspects.Firstly, we will balance our imbalanced datasets on the data level, which contains theprocess of oversampling positive samples and undersampling negative samples. For theresampling of positive samples, we proposed an improved SMOTE algorithm to processthe samples from positive class. We will inject synthetic positive samples into theimbalanced datasets to increase the number of positive samples. For undersampling ofnegative samples, we introduce spectral clustering algorithm to our undersamplingprocedure. We will select the subset of negative samples and the superset of positivesamples so that the number of positive and negative samples tends to be balanced.Finally, we will process imbalanced datasets from the algorithm level. We improvethe classification methods for imbalanced datasets by the introduction of weightedsupport vector machines and AdaBoost algorithm. The algorithm trains a number of base classifiers, and does the operation of combining multiple base classifiers into astrong classifier. To measure the performance of classifiers constructed on imbalanceddatasets, we use the area under ROC curve not the accuracy of the classifiers. ROCcurves give attention to classification results for both positive and negative samples.After processing from data level and algorithm level, we could use tradition learningmethods to analysis and mining useful information from imblanced datasets. In addition,the time and space resource that needed to construct the model on the imbalanced andsmall sample data processed by our algorithm decreased. Finally, we verify ouralgorithm on artificial and UCI datasets respectively.The algorithms proposed in this paper can slow down the difficulties brought byimbalanced and small sample datasets. Besides, the parameter optimization method forfeature extraction algorithms proposed in this paper has significant meaning for bettermining useful information from original data.
Keywords/Search Tags:imbalanced and small sample, feature extraction, feature fusion, resampling, support vector machine classification
PDF Full Text Request
Related items