Font Size: a A A

Research On Parallel Data Mining Algorithm Based On Hadoop

Posted on:2017-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ZhangFull Text:PDF
GTID:2308330485989506Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
WIth the speedy development of scientific research, communications technology and IT technology, Big Data sets from GB to TB,even the future of ZB. Cloud computing brings a vitality of life by virtue of its superior computing power and reliable computing power of data mining technology.In this paper, distributed computing platform, which is in terms of two core technologies of the platform and MapReduce HDFS, implements classification and clustering algorithm parallelization. Through the experiments, they prove the classification and clustering algorithms have good speedup and scalability of distributed computing resultst. The main contents are as follows:This paper introduces the open-source distributed computing platform, Hadoop, comprising its two core technologies MapReduce and HDFS. It introduces the operating mechanism and the realization principle of MapReduce and HDFS. It gives the concept of data mining techniques and describes the classification and clustering algorithms. And according to the characteristics of data mining technology based on existing knowledge, it analyses the development trend of data mining.Based on the Hadoop theory, this paper designs a high reliable Hadoop platform. For the 1.0.0 versions of Hadoop prior to the lack of security authentication,this paper introduces Kerberos security policy; For a single-node failure problems of HDFS NameNode and MapReduce JobTracker, it uses DRBD mirrored block device storage technology and eventually builds a successful high reliable and secure Hadoop environment.The paper focuses on the major ideas and code-based Hadoop platform K-Means clustering algorithm implemented; And in the light of cloud computing platform clustering algorithm has better scalability and higher efficiency by several groups of experiments described.This paper describes the main conception of Hadoop-based platform Naive Bayesian classification algorithm and implementation code; And through several experiments, practice shows that the clustering algorithm depends on cloud computing platform has high scalability.
Keywords/Search Tags:Hadoop, Data Mining, Classification and clustering algorithm, HDFS
PDF Full Text Request
Related items