Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine

Posted on:2016-10-17

Degree:Master

Type:Thesis

Country:China

Candidate:D D Shang

Full Text:PDF

GTID:2298330467988291

Subject:Software engineering

Abstract/Summary:

With the development of Internet, large amount of data generated fromvarious network applications brings us into the age of big data. In order to solvethe problem that traditional data mining methods can not deal with explosivegrowing data, people continue to study new methods to get valuable informationfrom a mass of data. As an open source distributed framework, Hadoop has beenadopted to more and more business applications. Growing and developing all thetime, it contains more and more commercial value. With the help of MapReduceprogramming model, users can conveniently develop their own paralleled dataprocessing algorithm to process data in GB level instead of dealing with smalldata sets only in single machine environment. This provides a good way toexpanse cluster mining methods that have the characteristic of parallelism. Inorder to improve the stability of Hadoop clusters and reduce the impact oncalculation of performance caused by large number of nodesâ€™ failure, the idea ofcombining Hadoop technology with virtual machine is proposed, where Hadoopdistributed system is constructed with virtual machine nodes. Taking use of theadvantages that virtual machine can be centrally managed, and deployed fast,system manager can start up a new node to continue mining tasks when failurehappens in some virtual machine node, adjust the amount of virtual machinenodes according to the size of dynamic task quantity to avoid waste of computingresources and complete the data mining task as soon as possible at the same time.This paper researched a lot of data mining algorithms and analyzed thedistributed Hadoop framework. In order to realize accurate and efficientdistributed computation in a distributed system, k-means algorithm is improved here. Weighted attributes are added onto each feature to improve the accuracyrate of clustering results. A new algorithm of optimizing initial clustering centersoptimization is used to select clustering centers of the whole data to makeclustering results stable. After proving that the improved clustering algorithm iseffective, this paper proposed the parallelization method of the algorithm. Finally,this parallel MapReduce program is deployed in the Hadoop distributed systemconstructed of compute nodes based on VMWare virtual machine, realizesclustering analysis of the campus network behavior, and give suggestions aboutcharges of the campus network user. This also provides a reference for deployingbig data applications to do data analysis on virtual platform of universities now orin the future.

Keywords/Search Tags:

data mining, K-means algorithm optimization, Hadoop, MapReduce, virtualization

Related items

1	Research On Algorithm Of Data Mining Based On Hadoop
2	Research On Spatial Data Mining Based On Hadoop
3	Research And Implementation Of Data Mining Algorithms Based On Cloud Platform
4	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform
5	Research On Website Structure Optimization Technology Based On Web Usage Mining
6	Research On IPTV QOS Log Analysis Method
7	Research Of Frequent Itemsets Mining Algorithm Based On MapReduce Calculation Model
8	Based On Hadoop Data Mining Algorithm Analysis And Research
9	Research And Application Of Data Mining Algorithms Using Mapreduce
10	The Mapreduce Model In The Hadoop Implementation Of Performance Analysis And Optimization Improvements