Font Size: a A A

Clustering Algorithm Based On The Background Of Big Data

Posted on:2017-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2348330521450528Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology and the Internet,the data of modern society is expanding at an unimaginable speed,and how to manage and use the large-scale data has become a reality and inevitable trend.The large data processing capabilities of Cloud computing makes it possible to analysis and master endless information,knowledge and wisdom hidden in large data.Clustering analysis as the most commonly used static data analysis method is often used for pattern recognition,machine learning,data mining and other fields.With the advent of the era of big data,the use of clustering analysis in large data has become very common now.As a distributed computing framework for cloud computing environments,Hadoop has earned an international reputation for low cost and high efficiency,and it has been one of the biggest concerns and researches application.Many clustering algorithms have been implemented in Hadoop platform,such as K-means algorithm,spectral clustering,etc..Based on these achievements and aims at the problems,the main research works of this paper are described as follows:(1)This paper is detailed to introduce the Hadoop ecology,especially for HDFS distributed file system,and MapReduce distributed computing framework.We also have an in-depth discussion on the mechanisms of its multi-job chain and the Partition,Combine,etc.of shuffle stage;(2)This paper proposes a kind of parallel optimization algorithm based on Hash algorithm.We firstly mapped the massive volume and high dimensional data to a compressed identifier space,then mined the clustering relations and selected the initial clustering center.These steps avoid the sensitivity from the randomly select initial clustering centers and reduce the number of iterations of K-means algorithm.Finally the Partition and Combine mechanisms are applied to optimize the parallelization of this algorithm,more fully strengthen the degree of parallelization and execution efficiency.The experiments show that the proposed algorithm improves the clustering accuracy and stability,also has good processing performance;(3)This paper proposes a kind of efficient parallel clustering algorithm named PAClusteringon on Hadoop platform.According to the distribution,we firstly propose a weightbased idea to partition the dataset into a number of data blocks,then divide each data block into many groups in which the compact data will be gathered as a vector.Finally arborescence merge is applied to clustering.The new algorithm improves the clustering accuracy and avoids the iterative operation in clustering process.Experimental on different size of datasets show that this algorithm not only has higher accuracy and stability of clustering,also has good processing performance.
Keywords/Search Tags:big data, cloud computing, Hadoop, clustering, Hash, K-means, Microcluster, Arborescence merge
PDF Full Text Request
Related items