Font Size: a A A

Parallel Research On Data Mining Algorithm Based On YARN And Spark Framework

Posted on:2017-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:M H ChenFull Text:PDF
GTID:2308330482488158Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology, the depth of Internet applications and the popularization of personal computers, tablets, smart phones, smart home and other terminal equipment, the amount of worldwide data is increasing at a rapid rate, we have entered an era of Big Data. Faced with such a large-scale mass data, the traditional model(stand-alone model and the traditional parallel computing model) has been difficult to deal with it, and distributed computing platform emerged at the right moment provides a new way for massive data processing. Compared with the traditional model of parallel computation, the distributed computing platform completes segmentation data, task allocation, parallel processing, fault tolerance and other features by the underlying data, and it has the characteristics of easy to expand, learn, use and deploy, etc. The distributed computing platform is a simple abstract parallel programming model. Because the users only need to concentrate on solving parallel computing tasks that they need to solve and do not need to concern for parallel implementation details, so it greatly simplifies the design of parallel programs. Used in parallel algorithm design, the model has a high practical application value for enhancing the efficiency of the algorithm. This paper use the model for Parallel study DBSCAN algorithm of clustering analysis, the results are as follows:(1) It proposed a data sub-grid algorithm based on grid unit. The algorithm divides data sets of each partition into the rectangular block whose side is DBSCAN Eps radius, and this will fast accelerate the speed of data centralized data object to find Eps neighborhood. This eliminates the need to find Eps neighborhood of a data object in all the data set of the entire partition, so the range of finding Eps Neighborhood is narrowed to eight adjacent cells of data object. The experimental results show that the algorithm improves the clustering speed, and it has better acceleration ratio and expansion rate, and it is faster on clustering speed than traditional methods.(2) A new partition cluster consolidation method is proposed. While making full use of the advantages of distributed computing platforms, it optimizes the issue of the clustering and merger after the data partition clustering. In this paper, after clustering of boundary points again, it merges clustering after compared the difference between the again clustering results and original clustering results of boundary points. This method can merger various partition cluster only to re-calculation clustering on border point. This greatly improves the merger speed of partitions clustering.
Keywords/Search Tags:Distributed Computing, DBSCAN, Spark, YARN, Tachyon
PDF Full Text Request
Related items