Font Size: a A A

Research On Some Issues In Support Vector Machines

Posted on:2020-01-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:F ZhuFull Text:PDF
GTID:1488306512481784Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The idea of support vector machine(SVM)stems from the work proposed by Vladimir N.Vapnik and Alexander Y.Lerner in 1963.With the development of research over the past few decades,SVM has become one of the most popular algotithms in the community of machine learning,data mining and pattern recognition.SVM has been extended from the topics: binary classification and regression to many hot topics,such as feature selection,semi-supervised learning,top rank learning,ordinal regression,anomaly detection,multi-veiw learning.The models in these new topics have not only inherited the most characteristics of SVM,such as margin thoery,kernel trick,structural risk minimization,but also inherited SVM's drawbacks.The researches about overcoming SVM's drawbacks have always been a research direction in machine learning.These researches are important for SVM and its variants.In this paper,we focus on speeding up SVM training and the label noise in the training set.The main research and innovations are listed as follows:(1)For heuristic methods to preserve useful samples in support vector classification(SVC),a new method to retain the samples which are located near the separate hyperplane is proposed.The samples near separate hyperplance are found by extened nearest neighbor chain.This way could preserve the samples which would become support vectors with high probability and less samples which would not become support vectors.Compared with previous methods,it does not need to assume that there exits overlapping regions between different classes.The experimental results on several datasets have shown that we can preserve fewer samples than previous ones while the time complexity does not increase and the classification does not degrade seriously.(2)For heuristic methods to preserve useful samples in support vector regression(SVR),first we prove that the samples would become support vectors are located on or outside of the ?-tube;second,we prove that the samples located on or outside of the ?-tube are also located near the boundary of the data distribution;and then the samples located near the boundary of the data distribution are found by the difference of nearest neighbors' distributions whose nearest neighbors are found in a candidate set which is determined in the output space(in their label space).The time complexity to determine the candidate is O(1)since the label is only one-dimension.When the number of samples belonging to different label is close to each other,the size of candidate set is irrelevant to the size of the training set.It is only related to the parameter k.Therefore,our method to finding critical samples for SVR could be completed in the time complexity O(n).For Year Prediction Million Song Dataset(MSD)which contains 463,715 samples,our method is completed in 10 seconds while the performance does not degrade seriously merely preserving 1%of the training set.(3)For the support vector classification,the samples are weighted by the ratio of the width of the margin to the distance between the sample and the decision hyperplane.Therefore,the samples near the separate hyperplance are assigned with higher weights,which is consistent with the fact that the samples near the separate hyperplance are more important than others in SVC.The experimental results on several datasets show that classification accuracy of our method is better than that of SVM and density-margin support vector machines.(4)For one-class support vector machine(OC-SVM)and multi-class supervised novelty detection(Multi-class SND)which is an extension of OC-SVM,a unsupervised outlier detection method is incorporated to solve the label noise in the training set.The label noise is the sample which is annotated by wrong label.The label noise is regarded as outlier for the annotated label and weighted by lower value via unsupervised outlier detection method.Therefore,by combining unsupervised outlier detection both OC-SVM and Multi-class SND become robust to the label noise.
Keywords/Search Tags:machine learning, support vector machine, sample reduction, label noise, outlier detection, extended nearest neighbour chain, difference of the nearest neighbours' distributions
PDF Full Text Request
Related items