Font Size: a A A

The Application Of SVM-RFE Algorithm To Data Analysis

Posted on:2010-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y M LuFull Text:PDF
GTID:2178360272495963Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
SVM (Support Vector Machine, SVM) is a kind of statistical learning theory, which was proposed by V.Vapnik in AT&B bell laboratory, and used for classification and regression. SVM is a new general-purpose machine learning method developed in recent years, also a major research study achievement in the field of machine learning. SVM is able to better solve the problems of small samples, non-linear, disaster of dimension, and local minimum problem by using Structural Risk Minimization principle to instead of Empirical Risk Mminimization principle and nuclear function, and has a good ability of generalization. Hence, SVM has become one of the fastest growing study fields at the end of 1990's.The essence of SVM training is to solve a bound convex quadratic programming (QP) problem. For small-scale QP problems can be optimized through the most sophisticated algorithms such as Newton method, interior-point method. However, these algorithms usually require using the Hessian matrix, which usually needs a disproportionate amount of memory, resulting in long training time. The classical method of solving the QP problem is no longer viable when dealing with large training sets, especially when a large number of support vectors. Therefore, to redesign some training algorithms using for large samples has become an important aspect of SVM study. At present, researchers have put forward a number of SVM training algorithms using for large-scale training sets, such as chunking algorithm, decomposition method, SVM-light algorithm, SMO algorithm, GSMO algorithm. At the same time, there have been a lot of other different kinds of algorithms deforming from SVM, such as C-SVM, v-SVM, LS-SVM and so on.SVM has much superiority with which other statistical learning techniques can not match, because that it can find the global optimal solution for a lot of questions. However, when SVM deals with the practical application of large training data, there are still some problems, such as the speed of calculating and storage capacity. Therefore, the convergence rate of training algorithm and memory required by computation have become bottlenecks in the development of SVM, to design faster and more efficient algorithms has become the main goal of SVM.As a machine learning method, the computational complexity, as well as the training time of SVM changes non-linearly with the increasing of the number of samples and dimension of input space. Therefore, in addition to develop faster and more efficient algorithms, to pre-process the information on the training set reasonably is also an important way to improve the performance of SVM, while effective feature selection is an important aspect of pre-processing. To select good features reasonably and effectively, and to reduce the dimensions of features appropriately, for one hand, it can eliminate redundancy, speed up the speed of computation and improve classification efficiency; On the other hand, it can reduce the complexity of classifier, leading to a lower classification error rate.In this paper, based on the studying of the related theories of SVM and feature selection, we have learned a special feature selection algorithm, named Recursive Feature Elimination based on SVM (SVM-RFE), and applied SVM-RFE to three new application aspects–blood biochemical data analysis of the seamen, clinical data analysis of coronary heart disease and IGA nephropathy. The main goals of this paper are to apply SVM to design and implement classifiers and feature selection methods to the three kinds of medical data to extract important features. The work of achieving the goals was done independently.In this paper, the main research achievement and conclusions are as following:1. The selection and generation for training sets and test sets. The selection of training sets and test sets is regarded to play an important role, not only for the training and prediction results of classifiers, but also for the results of feature selection. In this paper, we used a method called kd-tree to generate Mi( i=1, 2, 3, for biochemical data of seamen, clinical data of coronary heart disease and IGA nephropathy, Mi is equal to 300, 1000, 1000, respectively) pairs of training sets and test sets. Kd-tree method overcomes the blindness of random sampling, partitions instances into different groups according to the similarity of instances, and then select samples randomly from each group. Hence, samples selected randomly from each group to be as training samples are more representative than selected by pure random. For one hand, the times of random selection will be reduced, the efficiency of the classifiers will be improved,and the effectiveness of feature selection will be increased.2. To choose parameters for SVM classifiersIt is necessary to choose penalty parameters C and parameterγof Gaussian kernel function for the classifiers of SVM before trained. Different values of C andγmay affect the prediction accuracy of classifiers, but under normal circumstances, it is not easy to determine the value of C andγ. Therefore, in this paper, we use 5-fold cross-validation algorithm in the package of libsvm, which is implemented by an associate professor named Lin Zhiren in National Taiwan University, to choose suitable values for C andγ.3. The selection and pre-processing for features.This article outlines some issues related to feature selection, and detailed a particular selection method - recursive feature elimination (RFE), then in accordance with the training sets generated by kd-tree method, using SVM-RFE algorithm selected some important features for the biochemical data if seamen, clinical data of coronary heart disease and IGA Nephropathy.The output of SVM-RFE algorithm is a list of features ranked according to their importance. In this paper, the feature selection process of SVM-RFE consists of three steps. First of all, to generate a ranking list of features and the list's corresponding optimal subset of features for each of training set. Therefore, there will be a total of Mi ranking lists of features and their optimal subsets of features for Mi training sets. Secondly, to count the frequency for each feature that would be in all of the Mi optimal subsets of features, and then re-ranked the features according to their frequencies to generate a new ranking list of features. Finally, to re-select an optimal subset of features according to the new ranking list of features, and the subset will be the result subset of features selected by this paper. The process of feature selection not only selected the most important features for the three data sets, but also improved the prediction accuracy of classifiers.4. SVM classifiersThis article provides an overview of the main contents of statistical learning theory, and discussed support vector machine related issues according to some related literatures. In this paper, Support Vector Machine mainly achieved two functions. First, to train the classifiers, set up models for classification. Second, being the evaluation criteria of RFE. Because RFE is one of the feature selection methods of wrapper model, it needs a pre-defined classifier to evaluate the goodness of the selected features. Therefore, SVM as the pre-defined classifier to help RFE select the optimal subset of features, that is, to achieved the goal of data analysis in this paper.5. Analysis of experiment resultsFor the validity of analysis of the experimental results, in this paper, we used prediction accuracy, sensitivity and specificity of a classifier to evaluate, and made a comparison of values of the three indictors before and after feature selection by SVM-RFE. The experiment results show that the averages of prediction accuracy, sensitivity and specificity of M SVMs for the three types of data have improved in varying degrees. In addition, being compared with T-test, a traditional statistical analysis method, we found that the optimal subset of features selected by SVM-RFE is much smaller than the subset selected by T-test, and the selected results made the classifiers obtain a better classification performance. SVM-RFE is superior to T-test in selection of relevant features, removing irrelevant and/or redundant features, as well as in improving the performance of classifiers.6. Analysis of clinical significanceFor biochemistry data of seamen, the clinical data of coronary heart disease and IGA nephropathy, we selected 10, 6 and 1 most important features, respectively. Meantime, we made some search of literatures for the ten most important features of the biochemistry data of seamen, and found long-term ocean-going may affect the normal functions of liver and kidney of the crew, and the probability that seamen may suffer from depression, diabetes, malnutrition may be higher than people who live in land.
Keywords/Search Tags:Support Vector Machine, Feature Selection, SVM-RFE, Recursive Feature Elimination, Medical Data Analysis
PDF Full Text Request
Related items