Font Size: a A A

The Research And Design Of Distributed Data Mining System Based On Hadoop

Posted on:2013-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:G Y WuFull Text:PDF
GTID:2268330392970931Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Associated with the proposed rapid development of web2.0, cloud computing,networking concepts and technologies of the information age increasingly reflect thecharacteristics of its "big data". In order to exert the value of large-scale data, datamining technology in many areas of commercial, military, economic, and academicreceived more and more attention. At the same time, the huge scale of the data is amajor challenge to the traditional data mining technology. A combination of datamining and cloud computing is becoming a trend in the industry rely on the robustprocessing power provided by cloud computing and other distributed computingplatform and this kind of combination is constantly showing its strong advantages andpotential.Distributed systems, symbolized by Hadoop, are becoming a necessary part of alarge-scale data mining system. Therefore, this issue is exactly a kind of practice ofdata mining tasks on the Hadoop Distributed Systems. In this paper, the main task isto build a distributed cluster computation environment using Hadoop and implement adata mining task in the environment. Research Hadoop system architecture, and weget in-depth understanding of Distributed File System (HDFS) and MapReduceparallel programming model. We learn the data mining principle, and implement thetraditional data mining algorithms using MapReduce programming model and studythe implementation of Hadoop platform data mining algorithm, analysis of theefficiency and scalability. We select data clustering task as a representative, and selectthe K-means clustering algorithm to do in-depth research to grasp the principle andcompile its MapReduce version, test and verify its effect on the Hadoop platform.Different cluster size and scale data comparative test derived that Hadoop DistributedSystem has good speedup and efficiency in data mining tasks and analysis ofcomputing power extended performance also shows its great potential.
Keywords/Search Tags:Distributed Computing, Data Mining, Hadoop, K-means
PDF Full Text Request
Related items