Font Size: a A A

The Research And Implement Of Data Mining Algorithms Based On Hadoop

Posted on:2016-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:J W HeFull Text:PDF
GTID:2298330467992838Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the breakthrough of mobile smartphone operating system technology, the popularity of smart phones, the beginning of the mobile Internet era, web app are producing TB even PB level web logs every day. How to extract information about user’s personal preferences from these massive logs, to provide users with a personalized recommendation service, to bring convenience to people’s lives, become a hot topic of major Internet companies and research instittions to researchers. Because of the open source cloud computing platform Hadoop appeared,it’s possible to solve data mining massive web log information.The main contents of this paper include the following aspects.First of all, this paper has researched Hadoop cloud computing platform. Hadoop is an open source project under the top of the platform of Apache, it’s capable to take advantage of the thousands of cheap computers providing parallel computing and storage services. In this paper, I’ve studied the Hadoop Distributed File System HDFS in-depth, the parallel programming model MapReduce and the distributed nematic store database HBase.Secondly, this paper has researched Clustering analysis. Cluster analysis is an extremely used algorithm in data mining. This paper mainly studied the origins of cluster analysis, definitions and sample a similar distance, and introduced the common method of cluster analysis at detail. Thirdly, this paper designed and implemented data mining system based on Hadoop platform. It encapsulates the underlying interface of Hadoop system, provides several clustering algorithms which are involved in this article. The system Top-down included user layer, service engine layer, mining engine layer and the bottom layer of Hadoop drive.Fouthly, this paper researched the K-Means clustering algorithm and PAM clustering algorithm, and improved K-Means algorithm based on PAM. The improved algorithm overcomes the shortcomings of K-Means algorithm itself, and was complemented the parallelization on Hadoop platform, moreover, it optimized on improved algorithm from three levels.At last, this paper has made a lot of experiments on clustering algothrim based on Hadoop. And to test the algorithm correctness, this paper takes hook on effectiveness, optimized rate and speedup rate.
Keywords/Search Tags:data-mining, cluster-algorithm, k-meanshadoop, parallel-computing, bigdata
PDF Full Text Request
Related items