The Research And Implement Of Data Mining Algorithms Based On Hadoop

Posted on:2016-03-14

Degree:Master

Type:Thesis

Country:China

Candidate:J W He

Full Text:PDF

GTID:2298330467992838

Subject:Communication and Information System

Abstract/Summary:

With the breakthrough of mobile smartphone operating system technology, the popularity of smart phones, the beginning of the mobile Internet era, web app are producing TB even PB level web logs every day. How to extract information about userâ€™s personal preferences from these massive logs, to provide users with a personalized recommendation service, to bring convenience to peopleâ€™s lives, become a hot topic of major Internet companies and research instittions to researchers. Because of the open source cloud computing platform Hadoop appeared,itâ€™s possible to solve data mining massive web log information.The main contents of this paper include the following aspects.First of all, this paper has researched Hadoop cloud computing platform. Hadoop is an open source project under the top of the platform of Apache, itâ€™s capable to take advantage of the thousands of cheap computers providing parallel computing and storage services. In this paper, Iâ€™ve studied the Hadoop Distributed File System HDFS in-depth, the parallel programming model MapReduce and the distributed nematic store database HBase.Secondly, this paper has researched Clustering analysis. Cluster analysis is an extremely used algorithm in data mining. This paper mainly studied the origins of cluster analysis, definitions and sample a similar distance, and introduced the common method of cluster analysis at detail. Thirdly, this paper designed and implemented data mining system based on Hadoop platform. It encapsulates the underlying interface of Hadoop system, provides several clustering algorithms which are involved in this article. The system Top-down included user layer, service engine layer, mining engine layer and the bottom layer of Hadoop drive.Fouthly, this paper researched the K-Means clustering algorithm and PAM clustering algorithm, and improved K-Means algorithm based on PAM. The improved algorithm overcomes the shortcomings of K-Means algorithm itself, and was complemented the parallelization on Hadoop platform, moreover, it optimized on improved algorithm from three levels.At last, this paper has made a lot of experiments on clustering algothrim based on Hadoop. And to test the algorithm correctness, this paper takes hook on effectiveness, optimized rate and speedup rate.

Keywords/Search Tags:

data-mining, cluster-algorithm, k-meanshadoop, parallel-computing, bigdata

Related items

1	Parallel Data Mining Theory Research And Application
2	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
3	Research On Distributed Fast Clustering Algorithm Based On Mapreduce
4	Data Classification And Prediction Of The Model Based On Rbf Neural Network Parallel Learning
5	Parallel Processing Technology Research And Application Based On The Cluster Of Massive Remote Sensing Data
6	Research And Implementation Of Parallel FP-Growth Algorithm Based On Cluster Of PC
7	E-commerce Applications Based On Multi-core Cluster Parallelization
8	Key Techniques Study Of Parallel Splatting Algorithm On Cluster
9	Two Classes Of Biological Computing And Applications In Data Mining
10	Methods And Applications Study Of Cluster-based Spatial Data Mining