Font Size: a A A

Improved Support Vector Machine And Its Application

Posted on:2013-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2218330374971039Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Support vector machine, including support vector regression and support vector classification, which is based on statistical learning theory and structural risk minimization has solved the nonlinear, over fitting, and local minimum problems, as well as has the advantage of strong generalization ability. A series of algorithm improving was conducted in this study form three aspects of K-neighbors, classification conversion, High dimensional features selection, and the results are as follow.K-nearest neighbors (KNN) of training sample selection. The prediction accuracy of k-nearest neighbors which were selected from training samples often than that of all training samples, because of the heterogeneity of the samples. Therefore, the KNN SVR can speed up the training speed while at the same time improving the forecast accuracy by make full use of its characteristics of suitable for small sample. However, the optimal selection of k value is still an urgent problem. It costs too much to find the optimal k by search method of one by one starting from the training set, and it is the public optimal k value. In fact, the optimal neighbor number of each sample should be different; there is not a common optimal k value because of the differences between the samples. Based on principal component analysis (PCA), geostatistics (GS) and SVR, a novel individual forecast method about the quantitative structure-activity relationship (QSAR)—Weight-PCA-GS-SVR was proposed. The basic principles were as follows:Firstly, dimensions were reduced and redundant information from independent descriptors was eliminated using PCA; Secondly, the principal components having no relationship with activity were removed nonlinearly using SVR; Thirdly, weighted distances between samples were calculated by the retained principal components; Fourthly, a common range was confirmed using high-dimensional geostatistics; Lastly, k nearest neighbors of each test sample were found from the training set with their weighted distances shorter than common range, then the models were constructed and the individual prediction were feasible using SVR. Weight-PCA-GS-SVR could optimize the model along the column direction (descriptor) and row direction (sample) and had all advantages of SVR. So, it provides a new way in the field of choosing k nearest neighbors as well as a novel weighted method for the retained principal components or the retained descriptors. The predicted results of three data sets all verify that the novel method has the highest prediction precision in all reference models, and has a remarkable advantage over reported results. Weight-PCA-GS-SVR, therefore, can be widely used in QSAR and other regression prediction fields.Converts Multi-classification into two-classification. Besides a few simple two-classification mostly belongs to complicated multi-classification. Multi-classification should be transformed into two-classification to analysis. k(k-1)/2two-classifier need to be build and the process is tedious, when this conversion according to the traditional method using the "one to one". All right, there are still two shortages using the method of "one to other". Both of the two methods all have the defects such as inadequate use of information and low prediction accuracy. Converting the model of classification properly is also important. Based on principle of interaction, a novel and puts forward a new conversion of classification was put forward in this study. Firstly, the initial multi-class samples were transformed into two-class samples with interaction transformation. Secondly, a symmetrical kernel function was inducted to solve the rank problem of the two initial samples in interaction sampling pair. Thirdly, irrelevant and redundant features were eliminated nonlinearly with SVC and the relative importance of remained features were listed. Lastly, the prediction results were further corrected by simple-vote decision. The new method was applied to identify the butterflies in seven species at species level and family level, the accuracies both are100%. The result shows that the new method can be widely used in the prediction area of multi-class classification, such as automatic identification of insects.High-dimensional features of selection. Not every feature is useful to predict, what's more, redundancy characteristic would increase complexity and reduce the forecast accuracy of model. In theory, there are2m possible that selecting a subset with p (p<m) optimal features form m features, it's known to us that this is a NP-hard problem; it cannot exhaustion in the great m. The best k independent features are not the best k combined features, so it is necessary that selecting features in a unified model. In this study, a new high dimensional and nonlinear feature selection method was proposed, effectively achieved the unity selection of high dimensional features, which take the0,1replacement strategy to evaluate each feature. In this paper, a new O-glycosylation site prediction tool using only sequence information, MSCAA-OGlySite. Firstly,9723sequence features were extracted by the MSCC, with the RM screening, S, T sites were retained for38,53features and the accuracies were increased to94.0%and92.7%from83%and81%respectively.
Keywords/Search Tags:Support vector machine, k-neighbors, Convert multi-classification intotwo-classification, Features selection, Quantitative structure activity relationship, Prediction ofO-glycosylation
PDF Full Text Request
Related items