Font Size: a A A

Research On Outlier Detection Method And Its Key Techniques

Posted on:2014-11-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:B ChenFull Text:PDF
GTID:1268330422952651Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Outlier detection is to detect and discover those abnormal data patterns not conforming to normal(expected) behavior in observed data. These abnormal patterns are noted as outlier, inconsistent point,novelty or stain for different applications. Recent years, outlier detection is widely applied in faultdiagnosis, disease detection, intrusion detection, credit card (or insurance) fraud detection and personidenfication. In these areas, the abnormal pattern often implies significant (usually greatly harmedeven deadly) behavior. For instance, the abnormal traffic (behavior) in Internet may imply the leakageof sensitive information in attacked host, and credict card fraud behavior would lead to greateconomic loss. For the great pratical meaning and value, outlier detection is now becoming a veryactive and hot research area. As a result, many researchers pay close attention to the research in thearea.Different from other learning task, outlier detection task is with only data patterns conforming toexpected behavior (target class), and rare (even no) data patterns not conforming to expected behavior(outlier class). So there exists extreme imbalance (outlier samples are much less than target samples)leading to great difficulty in outlier detection. Therefore, recent research maily focused inunsupervised learning framework and supervised learning method with a very few labeled outliersamples. Based on the deep research on the principles of various outlier detection methods, robustnessto outliers and the embedding of prior knowledge, the contributions of this paper are as followed:1. First, One-cluster Clustering based Data Description (OCCDD) is proposed which employsthe PCM (Possibilisitic C-Mean) algorithm with one cluster, that is, P1M(PCM,C=1) to compute theweights, and hereafter, obtains an enclosing ball with weight averaging. As a result, OCCDD advoidsthe sensitivity to outliers and high training complexity in Support Vector Data Description (SVDD)due to minimax optimization. Second, global optimal charactistic of P1M which original PCM (C>1)has no is proved in theory. In the end, a multiview OCCDD is proposd to adapt the instinctivemultiview property in text classification. Different from general classifers learn in single view,multiview OCCDD simultaneously learns from all views, and increases the performance owing toeach view boosting mutally.2. A SVDD regularized with Area under the ROC curve (AUC) is proposed towards the situationthat outliers lie around the target samples. The regularized SVDD incorporates AUC measure into theoptimizing object of SVDD, and simultaneously optimizes the volume of minimum enclosing ball andAUC performance so as to deal with the extreme balance in class distribution. Then, two speed tricksare proposed to solve the high training complexity after AUC regularization. 3. A designing framework for manifold-based classifier: mXXX≈ISOMAP+XXX (here, XXXdenotes an existed learning algorithm based on Euclid Distance) is proposed, which replaces theEuclid distance in the feature space after ISOMAP dimension reduction by the Geodesic Distance ininput space, and implicitly conducts a ISOMAP without the truly ISOMAP process. When underlyingmanifold of the observed data existed, SVDD performance degrades since Euclid Distance cannotdepict the true geometrical structure, so we extend this method to SVDD and derivate a SVDD withManifold Embedding (mSVDD). After manifold embedding, mSVDD has advantages as follows:(1)With the approximation of Euclid Distances in the feature space induced by ISOMAP process, itsolves the problem that Geodesic Distance based SVDD cannot be directly optimized;(2)It avoidstruly Multidimensional Scaling (MDS) process in ISOMAP and selection of the dimension of theEuclid space after ISOMAP;(3) Different from formal Euclid Distance based SVDD, mSVDD isbased on Geodesic Distance, and implicitly executes a ISOMAP process, thus it can find a manifoldembedding.4. The relationship beween density estimation and domain-based outlier dectectors is revealed,especially, the essential relation between kernel density estimation and two domain-based outlierdetectors (One-Class Support Vector Machine (OCSVM) and SVDD) induced by Gaussian kernel.That is, domain-based outlier detectors are falling into the framework of density estimation. Moreover,the density estimator induced by OCSVM and SVDD is consistent to the true density; meanwhile,optimizing OCSVM and SVDD can also reduce the Integrated Squared Error (ISE).
Keywords/Search Tags:outlier detection, support vector data description, robustness, weighted averaging, possibilisitic C-means, multiview learning, AUC metric, manifold embedding
PDF Full Text Request
Related items