| As one of the hot topics in data mining area,outlier detection,a.k.a.anomaly de-tection,aims to discover the objects,who have anomaly behaviors,from the original data distribution.It can be applied in Abnormal crowd behavior detection,credit fraud,in-trusion detection,health care,internet of thing(IoT),etc.For outlier detection,it has two main challenges,i.e.,data dimension and data scale.However,utilizing centralized computing usually confront the “disaster curse” regarding data itself.Besides,with the in-crease of data scale,the runtime of data computing in a single node is unacceptable.The core work of this dissertation is based on solving the issues of high dimensionality and large scale in outlier detection.And in it,the performance evaluation of distributed outlier detection methods and one-class technology are studied.Especially,proposing a specific distributed model or a distributed model combined with a dimensionality reduction tech-nique with respect to several proximity-based outlier detection approaches(unsupervised learning method).For one-class support vector machine technique-based outlier detection methods(semi-supervised learning method),we propose a hybrid detection model,which is combined with a dimensionality reduction technique.Specifically,this dissertation con-ducts research on distributed outlier detection from four aspects:(1)For distributed outlier detection algorithms based on proximity,this dissertation proposes a new statistical concept of “extended variance” and “grid clustering degree” as a quantitative evaluation of algorithm performance,and proves the mathematical nature of“extended variance”.The specific method of the distributed algorithm based on proximity is to divide the grid on data space,where after adopting an allocation algorithm to allocate the data objects in each grid to data nodes with different performance.Through analysis,we can conclude that “extended variance” and “grid clustering degree” can be used to measure the distribution algorithm’s balance of data allocation and the quantification of network load generated by the model in the proximity-based outlier detection problem.(2)Aiming at the problem of high computing time caused by the increase of data scale in the outlier detection problem,this dissertation designs a data allocation algorithm that can be arranged in clusters with different performances,and separately compares the proximity-based methods Two typical methods(density-based and distance-based)are de-signed for calculation models.Both models can use multiple computers with different per-formance to accelerate the calculation of outliers,so they are more flexible.Finally,Ex-periments demonstrate the effectiveness and robustness of proposed methods.(3)This dissertation combine the dimensionality reduction technique,distributed tech-nique,and local outlier probability(LoOP)method to propose a novel model,which effec-tively addresses the problem of data dimensionality and data scale.)A common issue,i.e.,“disaster curse”,usually exists in proximity-based outlier detection,thereby reducing data dimensionality is a core point in current research.This dissertation conduct several exper-iments,which demonstrate that using stacked autoencoder(SAE)can capture the feature of original data well.Based on improving the time efficiency of the detection algorithm,the AUC and Recall of the detection algorithm are acceptable.Even in some data sets,the performance on AUC and Recall has been improved since SAE uses dimensionality reduction to reduce the redundant information in a given dataset.(4)This dissertation proposes a semi-supervised hybrid model.The model consists of two parts.Among them,unsupervised SAE is trained to extract common basic features.Then,the training set is randomly sampled to obtain multiple subsets.Each subset is used to train a one-class support vector machine(OC-SVM)as an outlier detector,and finally a comprehensive judgment result.To this end,this dissertation proposes a semi-supervised hybrid model,which contains two key components.It leverages SAE in an unsupervised manner to extract the common features,whereafter randomly sampling multiple subsets from the training dataset.And each subset is used to train a one-class support vector machine(OC-SVM)outlier detector.Since this dissertation distribute each outlier detector to different data notes,this process reduces the amount of calculation.In the end,feeding the test dataset into each outlier detector to obtain a joint judgment result.The experiments demonstrate that our proposed method owns better performance of outlier detection when using a stacked autoencoder while decreasing the runtime of training and testing time in comparison to the state-of-the-art baselines. |