Font Size: a A A

Research Of Outlier Detection Algorithm Based On Hadoop

Posted on:2016-12-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y P GuoFull Text:PDF
GTID:2308330482450607Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Outlier detection, viewed as a fundamental means of data mining, is aimed at discovering abnormal objects which deviate from the majority of objects in the datasets. Up to now, there have a lot of outlier detection algorithms been developed, they are widely used in the manufacturing industry, financial industry, network security and even in the medical area.However, in the era of big data, most existing outlier detection algorithms are not very effective and time-consuming when detecting outliers in massive data. Fortunately, Google’s MapReduce brings the hope of processing and analyzing big data. Furthermore, every researcher can develop distributed program to deal with big data conveniently with Hadoop.The problems of how to detect global outliers and local outliers in massive mixed-data datasets are focally investigated in this paper. Furthermore, we parallelize serial algorithms which are proposed in this paper to parallel algorithms based on Hadoop. The main contributions of this paper are summarized as follows:(1) Aim at the global outlier detection, a global outlier detection algorithm is proposed for mixed data based on nearest neighbors. This algorithm firstly defines the dissimilarity measure for mixed data in the light of neighborhood counting. Then, the definition of outlier factor is given. Outliers are those points having the largest values of outlier factor. To further improve the efficiency of the algorithm, a parallel outlier detection algorithm is designed based on Hadoop. The performance of the algorithm has been studied on several real world datasets. The comparisons with other outlier detection algorithms show that the proposed algorithm is more effective in detecting outliers with the merits of few parameters and high precision. And the experiment results of parallel algorithm show that it has high efficiency and scalability for massive mixed datasets.(2) Aim at the local outlier detection, a local outlier detection algorithm is proposed based on clustering and density. We define the local outlier factor based on density, and reduce the datasets scale through pruning the non-isolated objects. To further improve the efficiency of the algorithm, a parallel outlier detection algorithm is designed based on Hadoop. Finally, on the artificial datasets and UCI datasets we verify the effectiveness of the proposed serial algorithm, and the experiments on Hadoop platform show that the parallel algorithm has a good speedup and scalability.(3) Base on Hadoop and B/S architecture, a platform of outlier detection is designed and implemented. It contains two distributed outlier detection algorithm which are proposed in this paper. And it provides friendly graphical interface, so users can conveniently achieve assignment of file management, outlier detection, viewing of results through Internet.
Keywords/Search Tags:Outlier detection, Hadoop, Global outlier, Local outlier, Mixed data
PDF Full Text Request
Related items