Clustering Algorithm Based On The Background Of Big Data

Posted on:2017-10-08

Degree:Master

Type:Thesis

Country:China

Candidate:B Zhang

Full Text:PDF

GTID:2348330521450528

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer technology and the Internet,the data of modern society is expanding at an unimaginable speed,and how to manage and use the large-scale data has become a reality and inevitable trend.The large data processing capabilities of Cloud computing makes it possible to analysis and master endless information,knowledge and wisdom hidden in large data.Clustering analysis as the most commonly used static data analysis method is often used for pattern recognition,machine learning,data mining and other fields.With the advent of the era of big data,the use of clustering analysis in large data has become very common now.As a distributed computing framework for cloud computing environments,Hadoop has earned an international reputation for low cost and high efficiency,and it has been one of the biggest concerns and researches application.Many clustering algorithms have been implemented in Hadoop platform,such as K-means algorithm,spectral clustering,etc..Based on these achievements and aims at the problems,the main research works of this paper are described as follows:(1)This paper is detailed to introduce the Hadoop ecology,especially for HDFS distributed file system,and MapReduce distributed computing framework.We also have an in-depth discussion on the mechanisms of its multi-job chain and the Partition,Combine,etc.of shuffle stage;(2)This paper proposes a kind of parallel optimization algorithm based on Hash algorithm.We firstly mapped the massive volume and high dimensional data to a compressed identifier space,then mined the clustering relations and selected the initial clustering center.These steps avoid the sensitivity from the randomly select initial clustering centers and reduce the number of iterations of K-means algorithm.Finally the Partition and Combine mechanisms are applied to optimize the parallelization of this algorithm,more fully strengthen the degree of parallelization and execution efficiency.The experiments show that the proposed algorithm improves the clustering accuracy and stability,also has good processing performance;(3)This paper proposes a kind of efficient parallel clustering algorithm named PAClusteringon on Hadoop platform.According to the distribution,we firstly propose a weightbased idea to partition the dataset into a number of data blocks,then divide each data block into many groups in which the compact data will be gathered as a vector.Finally arborescence merge is applied to clustering.The new algorithm improves the clustering accuracy and avoids the iterative operation in clustering process.Experimental on different size of datasets show that this algorithm not only has higher accuracy and stability of clustering,also has good processing performance.

Keywords/Search Tags:

big data, cloud computing, Hadoop, clustering, Hash, K-means, Microcluster, Arborescence merge

PDF Full Text Request

Related items

1	Research On K-Means Clustering Algorithm Based On Hadoop Cloud Computing Platform
2	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
3	Research And Implementation Of Disaster Big Data Management Methods Based On Cloud Computing
4	Research Of Clustering Mining Algorithm Oriented Big Data
5	Research On Data Mining Technology Of Internet Of Things Based On Cloud Computing
6	K-Means Algorithm Design And Implementation Based On Hadoop And Mahout
7	Cloud Computing-based Integratedoperation Management Platform Research
8	Research On Parallelization Of Text Clustering Based On Hadoop Cloud Computing Platform
9	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
10	A Research And Implementation With Improved K-Means Clustering Algorithm To Image Retrieval System Based On Hadoop Platform