Font Size: a A A

Research And Implementation Of Local Outlier Detection Algorithm On Hadoop

Posted on:2019-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y M LiuFull Text:PDF
GTID:2428330566974162Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology and Internet information technology,the world has entered the era of massive data.People have changed from data consumers to data producers.The size of the data is expanding rapidly,and the main problem they face is how enterprises and users get valuable information from massive amounts of data.The exploration of the knowledge contained in the data has promoted the development of the field of data mining.The detection of outliers is one of the most popular data mining.As the data volume increases,it also contains valuable information.The field of outlier detection is extensive,such as network intrusion detection,medical disease monitoring and environmental detection.In addition,detection,bank financial fraud,environmental detection and other fields.In addition,the technology of cloud computing and distributed principle dealing with large scale data is developing rapidly.The open source Hadoop platform represents the massive data storage and data processing,which makes it more effective and convenient.This article through the research and analysis of the current local outlier detection algorithm,uses the theory of density clustering in the data preprocessing based on clustering to filter clustering cluster point and reduce the data size.The calculation of the outlier factor value of the detection algorithm is improved.At the same time,a distributed local outlier detection algorithm based on density clustering is proposed.In this paper,firstly,the local outlier detection algorithm is researched.This paper mainly introduces the based on density algorithm LOF,COF and INFLOF,studies the principle and implementation of the three algorithms.Secondly,this paper researches the algorithm based on density based on clustering algorithm,and introduces the characteristics and calculation methods of two different density clustering models in detail.Then,this paper carries on the discussion to the various components of the Hadoop ecosystem,and carries out an in-depth study of the distributed file system storage and data reading and writing principle,design principle of the database,The execution architecture and programming model of MapReduce,and realizes the principle of parallel computation and other component coordination working mechanism.According to the above theory research,this paper proposed an improved algorithm for local outlier detection,combined the theory of attribute weighted information entropy theory and distributed detection algorithm to solvethe problems of large scale of data facing the current detection algorithm,computing bottlenecks caused by complex data components and scalability of data nodes.At the end of this paper,the Hadoop distributed system and the HBase database environment are built for the cluster.Then we implement the proposed algorithm on the cluster.Experiments demonstrate that the distributed detecting outliers based on density clustering algorithm can effectively detect large scale data.Compared with the parallelizable COF algorithm and the LOF algorithm,the time complexity is small,and the accuracy rate is relatively high.The problem of data scalability can be solved by increasing the amount of data nodes in the cluster.
Keywords/Search Tags:big data, density clustering, local outlier detection algorithm, LOF, Hadoop
PDF Full Text Request
Related items