Research On Parallel Data Mining Algorithm Based On Hadoop

Posted on:2017-05-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Zhang

Full Text:PDF

GTID:2308330485989506

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

WIth the speedy development of scientific research, communications technology and IT technology, Big Data sets from GB to TB,even the future of ZB. Cloud computing brings a vitality of life by virtue of its superior computing power and reliable computing power of data mining technology.In this paper, distributed computing platform, which is in terms of two core technologies of the platform and MapReduce HDFS, implements classification and clustering algorithm parallelization. Through the experiments, they prove the classification and clustering algorithms have good speedup and scalability of distributed computing resultst. The main contents are as follows:This paper introduces the open-source distributed computing platform, Hadoop, comprising its two core technologies MapReduce and HDFS. It introduces the operating mechanism and the realization principle of MapReduce and HDFS. It gives the concept of data mining techniques and describes the classification and clustering algorithms. And according to the characteristics of data mining technology based on existing knowledge, it analyses the development trend of data mining.Based on the Hadoop theory, this paper designs a high reliable Hadoop platform. For the 1.0.0 versions of Hadoop prior to the lack of security authentication,this paper introduces Kerberos security policy; For a single-node failure problems of HDFS NameNode and MapReduce JobTracker, it uses DRBD mirrored block device storage technology and eventually builds a successful high reliable and secure Hadoop environment.The paper focuses on the major ideas and code-based Hadoop platform K-Means clustering algorithm implemented; And in the light of cloud computing platform clustering algorithm has better scalability and higher efficiency by several groups of experiments described.This paper describes the main conception of Hadoop-based platform Naive Bayesian classification algorithm and implementation code; And through several experiments, practice shows that the clustering algorithm depends on cloud computing platform has high scalability.

Keywords/Search Tags:

Hadoop, Data Mining, Classification and clustering algorithm, HDFS

PDF Full Text Request

Related items

1	Research On Algorithm Of Data Mining Based On Hadoop
2	Research And Implementation Of Web Log Storage And Analysis System Based On Hadoop
3	Research Of Data Mining Method For Public Buildings Energy Consumption Based On Hadoop
4	The Analysis And Research Of Data Mining Classification Algorithm Based On Hadoop Platform
5	Research And Design Of Parallel K-prototypes Clustering Algorithm Based On Hadoop
6	Research And Implementation Of Sales Forecast In Hadoop-based Enterprise Marketing System
7	Research Of Massive Data Processing And Mining In Database Marketing Based On Hadoop
8	Research Of Clustering Mining Algorithm Oriented Big Data
9	Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine
10	Research Of Clustering Algorithm Based On Mahout