Parallel Research On Data Mining Algorithm Based On YARN And Spark Framework

Posted on:2017-04-17

Degree:Master

Type:Thesis

Country:China

Candidate:M H Chen

Full Text:PDF

GTID:2308330482488158

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of science and technology, the depth of Internet applications and the popularization of personal computers, tablets, smart phones, smart home and other terminal equipment, the amount of worldwide data is increasing at a rapid rate, we have entered an era of Big Data. Faced with such a large-scale mass data, the traditional model(stand-alone model and the traditional parallel computing model) has been difficult to deal with it, and distributed computing platform emerged at the right moment provides a new way for massive data processing. Compared with the traditional model of parallel computation, the distributed computing platform completes segmentation data, task allocation, parallel processing, fault tolerance and other features by the underlying data, and it has the characteristics of easy to expand, learn, use and deploy, etc. The distributed computing platform is a simple abstract parallel programming model. Because the users only need to concentrate on solving parallel computing tasks that they need to solve and do not need to concern for parallel implementation details, so it greatly simplifies the design of parallel programs. Used in parallel algorithm design, the model has a high practical application value for enhancing the efficiency of the algorithm. This paper use the model for Parallel study DBSCAN algorithm of clustering analysis, the results are as follows:(1) It proposed a data sub-grid algorithm based on grid unit. The algorithm divides data sets of each partition into the rectangular block whose side is DBSCAN Eps radius, and this will fast accelerate the speed of data centralized data object to find Eps neighborhood. This eliminates the need to find Eps neighborhood of a data object in all the data set of the entire partition, so the range of finding Eps Neighborhood is narrowed to eight adjacent cells of data object. The experimental results show that the algorithm improves the clustering speed, and it has better acceleration ratio and expansion rate, and it is faster on clustering speed than traditional methods.(2) A new partition cluster consolidation method is proposed. While making full use of the advantages of distributed computing platforms, it optimizes the issue of the clustering and merger after the data partition clustering. In this paper, after clustering of boundary points again, it merges clustering after compared the difference between the again clustering results and original clustering results of boundary points. This method can merger various partition cluster only to re-calculation clustering on border point. This greatly improves the merger speed of partitions clustering.

Keywords/Search Tags:

Distributed Computing, DBSCAN, Spark, YARN, Tachyon

PDF Full Text Request

Related items

1	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
2	A Research About DBSCAN Text Clustering Based On Spark Platform
3	The Design And Implementation Of Log Analysis System In Cloud Computing Environment
4	Research On Improved DBSCAN Algorithm Based On Spark Platform
5	KDSG-DBSCAN:A High Performance DBSCAN Algorithm Based On K-D Tree And Spark GraphX
6	Research On Adaptive Parameter Of DBSCAN Algorithm And Its Application On Spark Platform
7	Study Of MPI/GPU Parallel Computing Processing Mechanism On Spark
8	Design And Implementation GPU Training Platform Based On YARN
9	The Design And Implementation Of Data Mining System On Yarn
10	The Design And Implementation Of Hot Topic Detection System Of Tweets Based On Spark On Yarn