Font Size: a A A

Research On Outlier Detection Based On Support Vector Machines

Posted on:2010-09-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:J TianFull Text:PDF
GTID:1118360302460484Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Outlier detection refers to the problem of finding patterns in data that do not conform to expected behavior. These nonconforming patterns often imply potentially useful information. Outlier detection is one of the most important contents in the data mining community. Outlier detection finds extensive use in a wide variety of applications such as credit fraud detection, fault detection, health care, intrusion detection for network security, image retrieval. In recent years, domestic and overseas scholars are focused on applying Support Vector Machine (SVM) theory to the tasks of outlier detection, and many results have been obtained. As the research and application going, the existed methods and techniques face some difficulties on the generalization ability and robust stability of the outlier detection models. For the above observations, this dissertation will focus on SVM method and try to find new techniques for efficient and robust outlier detection based on SVMs. It covers:1. Research on semi-supervised or unsupervised outlier detecion methods based on One-Class SVM (OCSVM). In practice, availability of labeled data for training and validation of models used by outlier detection techniques are major issues, there are only few labeled outliers in databases. One-class classification techniques are promising in detecting new outliers. However, such techniques usually gain high detection rate with high false positive rate, because proper parameters are difficult to select and the choice of origin as the separation point is arbitrary and affects the decision boundary returned by the algorithm. A new model is proposed which makes use of receiver operating characteristic (ROC) analysis technique, and the optimum parameters are automatically searched in limited scope using two techniques, then lead to the detection decision function after a boundary movement process. To identify the ideal hyperplane, a new algorithm named "local density OCSVM" is proposed by incorporating distance-based local density degree to reflect the overall characteristics of the target data. Finally, an "Outlier OCSVM" is proposed and a framework is designed for unsupervised outlier detection. Respectively scored by distance from hyper-plane and probabilistic output value, two definitions of outlier degree are presented. After picking out some suspicious outliers via combining the two criterions of outlier degree, the model starts the training operations and two parts of the data set are updated interactively through comparison of the outputs. 2. Research on robust classification models combined data preprocess techniques and SVMs in outlier detection. The experimental data sets are likely to contain outliers or noises, which can lead to poor generalization ability and classification accuracy for SVMs. This happens because the outliers may become boundary support vectors and contribute to the decision function, in addition, the high dimensional feature databases can reduce the efficiency and performance. A method using Weighted SVM (WSVM) combined with Principal Component Analysis (PCA) is then proposed for robust prediction of protein subcellular localization. After performing dimension reduction operations on the data sets, more suitable weights are generated for further training, as PCA transforms the data into a new coordinate system with largest variances affected greatly by the outliers. Gaussian process latent variable model (GPLVM) is also used for the purpose of nonlinear low dimensional embedding of sample data sets, and a new ladder jumping dimensional reduction classification framework is proposed for effectively confirming the objective dimension.3. Research on hybrid methods for solving imbalanced classification problems in outlier detection. The data sets used in outlier detection applications are usually imbalanced, which have detrimental effects on the performance of an SVM classifier, because the classifier may be strongly biased towards the majority class. A new resampling algorithm based on a modified OCSVM is then proposed, and a two-stage outlier detection approach is designed after combining the resampling algorithm with a cost sensitive SVM. Low weights were set for outliers, and some common points were removed proportionally by the hyperplane in feature space, as could also overcome the effect of overlapping data points. The optimal parameters of the cost sensitive SVM is searched and the cumulative misclassification costs are reduced. Moreover, a new method using ensemble learning method is proposed. Both minority and majority classes are resampled to increase the generalization ability. For majority class, just instead of all data, the prototypes of the clusters are selected. In essence, this could form a way of undersampling of this class. The clusters are used to build an SVM ensemble with the oversampled minority patterns. For minority class, an OCSVM model combined with synthetic minority oversampling technique (SMOTE) is used to oversample the support vector instances. Hybrid methods adopt both strategies of modifying the data distribution and adjusting the classifier, present hight true positive rates with low false positive rates.
Keywords/Search Tags:Support Vector Machines, Outlier detection, One-Class Classification, Imbalanced Classification, Kernel Method
PDF Full Text Request
Related items