Font Size: a A A

Research On Hierarchical Clustering Algorithm And Parallelization In Massive Data Environment

Posted on:2020-02-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:1488306512981599Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet of Things technology,more and more sensors,mobile terminals and computers are connected through the network.IoT-based sensors have been applied to all aspects of our life such as power systems,transportation systems,building systems,water systems,oil and gas systems,and household appliances.In recent years,due to the extensive and in-depth applications of Internet of Things technology,the demand for computing resources by massive data processing has shown an explosive growth.The emergence of cloud computing and edge computing technologies provide a large amount of computing resources for storing and processing of massive data,while big data technology provides effective technical support for massive data analysis.The manner of applying classic data mining algorithms to the environment of massive data in big data technology,along with making full usage of the computing resources which are provided by cloud computing and edge computing and efficiently processing massive data in various fields of social life mining are the challenges which should be dealt in today's data mining researches.This thesis takes the classical data mining algorithms such as cluster analysis and outlier detection commonly used in IoT data processing ap-plications as the goal,and takes the implementation of distributed storage and parallel computing framework as the research core,by focusing on the parallelization of classical data mining algorithms in massive data environment.Finally,this thesis proposes a solu-tion for nearest neighbor search,hierarchical clustering,and outlier detection in massive data environment.The main results and innovations of the work done in this thesis are summarized as follows:(1)For the problem that the nearest neighbor search in the hierarchical clustering al-gorithm has high time complexity and can not be processed and analyzed in the massive data environment,a nearest neighbor fast search method based on nearest neighbor boundary is proposed and applied to hierarchical clustering.In the algo-rithm,the time complexity and space complexity of the algorithm are effectively reduced.In this thesis,the characteristics of nearest neighbor search methods in hierarchical clustering are studied.The data segmentation technique and nearest neighbor search are combined to propose the concept of nearest neighbor boundary(NNB),which can effectively improve the efficiency of searching nearest neighbors.By studying the similarity of neighbors,the measurement method and the relation-ship between the metrics which are based on the nearest neighbor boundary data segmentation technique could reduce the algorithm complexity of the hierarchical clustering while maintaining the classification accuracy of the algorithm.Finally,the nearest neighbor search method based on the nearest neighbor boundary is applied to the hierarchical clustering algorithm,and the effectiveness of the hier-archical clustering algorithm based on the nearest neighbor fast search(NBC)is verified experimentally.(2)For hierarchical clustering algorithm in the massive data environment,due to the limitation of the performance of a single computer,it is impossible to process and analyze the massive data.NBCP is proposed to parallelize the hierarchical cluster-ing algorithm based on nearest neighbor fast search and deployed to the Hadoop platform.In this thesis,the data grouping technique based on nearest neighbor boundary is studied.The nearest neighbor search work is separated into multiple in-dependent tasks,this forms the nearest neighbor search parallelization scheme.The task equalization strategy of data grouping in nearest neighbor search paralleliza-tion is proposed and given theoretical proof.Finally,the nearest neighbor search method based on the nearest neighbor boundary is applied to the MapReduce-based Hadoop distributed storage and computation framework,and the effectiveness of the hierarchical clustering algorithm parallelization scheme based on the nearest neighbor fast search is verified.(3)For the hierarchical clustering algorithm in the edge computing environment,the data storage and processing nodes are limited by the computing resources,and the distributed storage of the data in the tree structure is proposed,and the distributed storage of the nearest neighbor search is performed directly on the storage node.Parallel computing framework,and a distributed hierarchical clustering(DHC)par-allelization algorithm prototype that can effectively process massive data is imple-mented on this framework.This thesis proposes a distributed storage method based on data segmentation,which stores the massive data of nearest neighbor search in an efficient distribution,constructs a distributed storage and parallel computing framework based on nearest neighbor search technology,and builds a hierarchical clustering algorithm.A hierarchical clustering parallelization algorithm prototype called distributed hierarchical clustering is implemented on this framework;the al-gorithm stores a massive data set by using a distributed storage node of a tree structure,and performs parallel computing at each storage node,which is effective The performance of the algorithm is improved under resource-constrained condi-tions,which is suitable for data processing in edge computing environment.The performance of DHC under different parameters is evaluated experimentally,and the effectiveness of DHC algorithm is verified.(4)Due to the limited capabilities of sensor nodes in the Internet of Things,observa-tions collected from sensor nodes typically have lower data quality and reliability.There are many difficulties and challenges for the outlier detection for data in the Internet of Things.Conventional outlier detection algorithms often cannot detect abnormal data correctly.In this thesis,an outlier detection algorithm(OHC)based on hierarchical clustering is proposed.It is found that the tree diagram obtained by the hierarchical clustering process in the process of merging the nearest neighbors naturally reflects the density of the relationship between the objects.This thesis proposes the participation degree of the object in the hierarchical clustering process as the measurement standard of the abnormal point,called the participation degree,and gives the theoretical basis for the participation degree to be applied to the out-lier detection.The proposed OHC algorithm is an unsupervised outlier detection algorithm,which overcomes the shortcomings of some supervised outlier detection algorithms.In the design of OHC algorithm,the data segmentation technology based on nearest neighbor boundary is adopted,which enables the algorithm to process massive data.In the experiments,synthetic data and real data are used to analyze and compare the performance of the algorithm,and the OHC algorithm is verified in performance and effectiveness against other outlier detection techniques.
Keywords/Search Tags:Nearest neighbor search, hierarchical clustering, outlier detection, MapReduce, distributed storage and computing
PDF Full Text Request
Related items