Font Size: a A A

Studies Of Some Problems In Support Vector Machines And Semi-supervised Learning

Posted on:2010-08-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z X XueFull Text:PDF
GTID:1118360302469450Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the flying development of information technologies, during the course of collecting and processing information, the size of data sets confronting human becomes larger and larger, and the constitution of data samples also becomes more and more complicated. These facts have made machine learning received more and more attention and become one of the hot topics of research. Statistical Learning Theory (SLT) proposed by Vapnik provides a theoretical basis for machine learning. SLT concerns mainly the statistical laws and learning properties when samples are limited and can effectively improve the generalization ability of algorithm with using the principle of Structural Risk Minimization (SRM). As the latest development of SLT,Support Vector Machine (SVM) has many advantages such as global optimization, excellent adaptability and generalization ability, and sparsity solution. It can solve many practical appication problems such as small samples, nonlinear learning, over fitting, curse of dimentionality, and local minima and is a new milestone in the field of machine learning. So SVM has been widely used in pattern recognition, regression estimation, function approximation, density estimation, etc. Recently, inspired by the above advantages of SVM, some researchers proposed extend algorithms of SVM, which include Least Squares Support Vector Machines (LSSVM), Center Support Vector Machine (CSVM), Hypersphere Support Vector Machines (also called Support Vector Domain Description (SVDD), Sphere Sphere-based Pattern Classification (SSPC), etc. These algorithms improve and complement SVM from different aspects. In many machine learning problems, a large amount of data is available, but only a few of them can be labeled easily and others relative large amount of data can not be labeled because of all kinds of reasons (not easy or fairly expensive to obtain). The problem, combining unlabeled and labeled data together to learning the labels of unlabeled ones, is called semi-supervised learning. This thesis focuses on some problems existed in SVM, several extensions of SVM, and semi-supervised learning. The main works of the thesis is as folows:1. Study how to improve the learning speeds and classification accuracies of SVM under the condition of large scale sample sets. SVM takes very long time when the size of training data is large and the precision of classification is easily influenced by outliers, and we propose an SVM algorithm based on hull vectors and central vectors. Firstly, we find out convex hull vectors and center vectors for each class. Secondly, the obtained convex hull vectors are used as the new training samples to train standard SVM and the normal vector of hyperplane is obtained. Finally, in order to weaken the influence of the outlier, we utilize center vectors to update the normal vector and obtain final classifier. Experiments show that the learning strategy not only quickens the training speed, but also improves the classification accuracy.2. Study imbalance dataset classification problem for two variations of SVM, i.e., LSSVM and SSPC. For the problem of LSSVM on imbalance dataset classification problem, we take the number of samples and the dispersed degree of each class into consideration and adjust separation hyperplane in standard LSSVM. It overcomes disadvantages of traditional designing methods which only consider the imbalance of samples size and improves the generalization ability of LSSVM. As for SSPC, we provide the facility to control the upper bounds of two classes error rates respectively with two parameters. As such, the performance of classification and prediction of imbalance data sets can be improved, and the range of selection of parameters can be greatly narrowed. Experimental results show that the method can effectively enhance the classification performance on imbalance data sets.3. In this paper, We study the transductive learning in the field of semi-supervised learning via the following two ways. Firstly, progerssive transductive support vector machines (PTSVM) proposed by Chen have obvious deficiencies such as slower training speed, more back learning steps, and unstable learning performance. In order to overcome these shortcomings, we give two improved progressive transductive support vector machine algorithms. They inherit the PTSVM's progressive labeling and dynamic adjusting and utilize the information of support vectors or reliability values to select new unlabeled samples to label, and also combine with incremental support vector machines or pre-extracting support vector algorithm to reduce the calculation complexity. Exiperimental resuls show the above proposed learning algorithms can obtained satisfactory learning performance. Secondly, we proposed transductive learning strategies for a extend algorithm of SVM—SSPC. The proposed algorithms seek a hypersphere to separate data with the maximum separation ratio and construct the classifier using both the labeled and unlabeled data. This method utilizes the additional information of the unlabeled samples and obtain better classification performance when insufficient labeled data information is available. Experiment results show the proposed algorithm can yield better performance. 4. In this paper, we study semi-supervised outlier detection (SSOD) under the situation of the few labeled data and a wealth of available unlabeled data. The problem of outlier detection has always been a difficult task. In many applications, such as, network intrusion detection, fraud detection, medical diagnosis, outliers that deviate significantly from majority samples are more interesting and useful than the common samples. Fuzzy rough based semi-supervised outlier detection (FRSSOD) is proposed, which applies the theory of rough and fuzzy sets to SSOD. With the help of few labeled samples and fuzzy rough C-means clustering algorithm, this method introduces an objective function, which minimizes the sum squared error of clustering results and the deviation from known labeled examples as well as the number of outliers. Each cluster is represented by a center, a crisp lower approximation and a fuzzy boundary and only those points located in boundary can be further discussed the possibility to be reassigned as outliers. Experiment results show that the proposed method, on average, keep, or improve the detection precision and reduce false alarm rate as well as reduce the number of candidate outliers to be further discussed.
Keywords/Search Tags:Statistical learning theory, Support vector machines, Least squares support vector machines, Hypersphere support vector machines, Large scale sample sets, Imbalance classifications, Semi-supervised learning, Transductive learning, Outlier detection
PDF Full Text Request
Related items