Font Size: a A A

Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine

Posted on:2016-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:D D ShangFull Text:PDF
GTID:2298330467988291Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet, large amount of data generated fromvarious network applications brings us into the age of big data. In order to solvethe problem that traditional data mining methods can not deal with explosivegrowing data, people continue to study new methods to get valuable informationfrom a mass of data. As an open source distributed framework, Hadoop has beenadopted to more and more business applications. Growing and developing all thetime, it contains more and more commercial value. With the help of MapReduceprogramming model, users can conveniently develop their own paralleled dataprocessing algorithm to process data in GB level instead of dealing with smalldata sets only in single machine environment. This provides a good way toexpanse cluster mining methods that have the characteristic of parallelism. Inorder to improve the stability of Hadoop clusters and reduce the impact oncalculation of performance caused by large number of nodes’ failure, the idea ofcombining Hadoop technology with virtual machine is proposed, where Hadoopdistributed system is constructed with virtual machine nodes. Taking use of theadvantages that virtual machine can be centrally managed, and deployed fast,system manager can start up a new node to continue mining tasks when failurehappens in some virtual machine node, adjust the amount of virtual machinenodes according to the size of dynamic task quantity to avoid waste of computingresources and complete the data mining task as soon as possible at the same time.This paper researched a lot of data mining algorithms and analyzed thedistributed Hadoop framework. In order to realize accurate and efficientdistributed computation in a distributed system, k-means algorithm is improved here. Weighted attributes are added onto each feature to improve the accuracyrate of clustering results. A new algorithm of optimizing initial clustering centersoptimization is used to select clustering centers of the whole data to makeclustering results stable. After proving that the improved clustering algorithm iseffective, this paper proposed the parallelization method of the algorithm. Finally,this parallel MapReduce program is deployed in the Hadoop distributed systemconstructed of compute nodes based on VMWare virtual machine, realizesclustering analysis of the campus network behavior, and give suggestions aboutcharges of the campus network user. This also provides a reference for deployingbig data applications to do data analysis on virtual platform of universities now orin the future.
Keywords/Search Tags:data mining, K-means algorithm optimization, Hadoop, MapReduce, virtualization
PDF Full Text Request
Related items