Font Size: a A A

Research On Spectral Clustering Algorithm Based On Hadoop Platform

Posted on:2015-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:B G YangFull Text:PDF
GTID:2298330422490191Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering algorithm is one of the important technological means in the field of DataMining.The main contents of data mining is to excavate some useful and valuable informa-tion from a number of data, Data mining technology has been applied to many fields suchas industry, commerce. However, with the rapid development of many fields, the amountof data generated is constantly expanding, The traditional technology of data processing intime and hardware has been unable to meet the requirement of the increasing data, How toeffective handle massive amounts of data has become an urgent priority, Therefore, how toapply parallel computing methods to large data sets, has become a hot research field.The tranditional parallel computing model and computing method is mainly base onthe time parallel and space parallel, but the former has much higher requirements for dataprocessing. Parallel design process is more cumbersome,Mainly used in the field of scienti-fic computing in general, while the latter has much higher requirements for the parallelhardware, Hardware price is more expensive and quickly eliminated, this kind of phenome-non cause great waste. Nowadays, with the rapid growth in the amount of data, there is anurgent need for appropriate technology to solve this problem.Google proposed MapReduce computing mode, because it encapsulates theunderly-ing complex programming process, user don’t need to worry about writing complex datasegmentation procedure, task scheduler, parallel processing procedures, etc, they only needto care about problem that they need to solve themselves, Thus that it once launched, it re-ceived widespread attention, only drawback is that it is a “closed source” frame. ApacheHadoop achieved open source for MapReduce programming model in2008. As theincreaseing amount of data in recent years, the Hadoop platform has been used widely.This paper is after thorough understanding the Hadoop platform, build a Hadoop cloudcomputing test platform on Linux system, through the study of clustering algorithm,proposed a solution of spectral clustering algorithm parallelization based on Hadoopplatform. Compared with departed clustering algorithm, spectral clustering is more suitable forthe processing of large data. It won’t get into the optimal solution, when facing with thehigh dimension and irregular data. From the traditional process of spectral clusteringalgorithm, analyze that which part can be paralleled, is to compute the eigenvalue andeigenvector of Laplucian matrix, and combined with cloud computing platform to realizethe parallel processing. After getting familiar with MapReduce programming frameworkdeeply, we process the data partitioning and parallel tasks for spectral clustering algorithm.Using the Wikipedia data and artificial data for experimental test, the experiment resultshows that the spectral clustering algorithm paralleled display good effect on Hadoopplatform, improves the speed and time of processing data in the single greatly. In addition,it exhibits well at the speedup and data scalability; also have obvious advantage onmanaging volume data.
Keywords/Search Tags:DataMining, Spectral Clustering, Hadoop Platform, Parallel Computing
PDF Full Text Request
Related items