Font Size: a A A

Research And Implementation Of Data Mining Algorithms Based On Cloud Platform

Posted on:2014-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:W YanFull Text:PDF
GTID:2268330401964478Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the information society, the amount of data produceddaily is in exponential growth. Companies face a big problem that how to find outuseful information from mass data. Data mining algorithm processing data and mininghidden useful information, which is beneficial to the development of company to makedecision.But it will take a long time to deal with mass data, or inability to deal withmass data. An effective method to solve problem is to transfer the traditional algorithmto cloud platform for parallel improvement.Apache Hadoop is a distributed system framework.The HDFS provides high faulttolerance and high throughput rate of file storage, reading and writing. MapReduceprovides a parallel programming framework.The user write Map and Reduce class fordistributed program without knowing distributed parallel programming details. Becauseof mass data storage platform and simple parallel calculation platform, Hadoop providesthe basis for the traditional data mining algorithm processing mass data.In this dissertation, we study the Hadoop platform technology and common datamining algorithms, and use Hadoop cluster parallel processing ability of data on k-means algorithm and collaborative filtering algorithm for parallel improvements.Themain work is as follows:(1) K-Means clustering algorithm is a common algorithm.The original data isdivided into a plurality of clusters in accordance with the similarity between theelements.To address the defects of K-Means reliance on K clustering algorithm andinitial center in clustering algorithm, this dissertation proposes an improved clusteringalgorithm on the basis of the characteristics of sampling and density. The initial k valueand initial center are determined by sampling and density, and parallel improvement isbased on the Hadoop platform. Through the experiment, the improved K-Meansalgorithm has good parallelism.(2) Collaborative filtering algorithm is the most used items recommendationalgorithm.We find the k neighbors with the highest similarity by calculating usersimilarity and recommend items for users by the score of the neighbors of the items.In this dissertation, we propose a hybrid recommendation algorithm based on usersimilarity and attribute weights to solve user ratings sparsity. We obtained the weightsof users like properties through learning user ratings records and combined with the usersimilarity for users to recommend item. Finally, we transplant the algorithm to Hadoopplatform. Through the experiment, the improved collaborative filtering algorithm isbetter than the original algorithm in precision and parallel attribute.(3) At present, we use Hadoop platform via the command line, which is difficultyfor ordinary users. In this dissertation, we encapsulate low-level details of the Hadoopplatform, design and implement a data mining system based on Hadoop platform. Thesystem package data mining algorithm and Hadoop platform details, and provide Restinterface.The users calls parallelization of data mining algorithms for data analysisthrough the Rest interface, without having to understand the underlying concreterealization.
Keywords/Search Tags:Hadoop, MapReduce, Data Mining, K-Means, Collaborative Filtering
PDF Full Text Request
Related items