Font Size: a A A

Research On Algorithm Of Data Mining Based On Hadoop

Posted on:2016-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y X XieFull Text:PDF
GTID:2308330470471097Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The era of big data is approahing, traditional data storage methods could not accomplish the mass task of analysing, managing and mining, and how to discover useful information and knowledge quickly and effectively is a new topic of current data mining technology. In face of huge amounts of data, traditional data mining algorithms is inefficient and a waste of storage space. The emergence of cloud computing brings new methods for data mining algorithms to improve parallelism, its efficient programming model, massive storage capacity, powerful computing capabilities provide a broad platform for the development of data mining.Hadoop is Apache open source project for building cloud computing platform, the distributed computing platform has been very stable based on this project, it helps us building a cloud computing platform quickly and easily. Hadoop has been widely studied and applied because of its open source, high performance, flexible and easy to use. Hadoop uses MapReduce programming model for distributed computing and HDFS distributed file system for file storage, and has includes a series of subprojects such as databases and data warehouses.In this paper, the K-means algorithm in clustering algorithm is studied, through large-scale sampling of data, and using convex hull and opposite Chung points to solve the initial two cluster centers, the algorithm process has been modified, and the process implements parallelization through the MapReduce programming model. Finally, using the Reuters news set 21578 as a data source, comparative experiments with different distance measure, serial to parallel, and different cluster nodes have been done to verify the efficiency of the improved algorithm. Results show that compared with serial algorithm, the improved parallel algorithm improves obviously in reliability and efficiency with the increase of cluster nodes and data size.
Keywords/Search Tags:Hadoop, Data Mining, K-means, MapReduce, HDFS
PDF Full Text Request
Related items