Font Size: a A A

Research And Optimization On K-medoids Clustering Algorithm Based On Hadoop Platform

Posted on:2016-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:C F ZhangFull Text:PDF
GTID:2348330488973988Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and the emergence of interactive applications such as Blog, We Chat and social network, commercial data is exploding. The merchants need mine useful information from huge amount data to make research and decision. Facing with enormous amount of data, traditional data analysis tools are no longer applicable. Clustering analysis is one of the important techniques in data mining. It is a process that a set of objects is divided into multiple clusters according to some distance measure. As a method of unsupervised leaning,the initial category of clustering is uncertain. The traditional clustering analysis on single device cannot meet the demands of big data analysis, whether on computational efficiency or complexity, however, cloud computing provides a new approach.In this thesis, K-Medoids clustering algorithm is optimized and implemented based on Hadoop cloud computing platform, so it can make fast and efficient cluster analysis on big data. The main content of this thesis is as follows:(1) The traditional K-Medoids clustering algorithm is researched. For the disadvantage that it is necessary to specify the number of clusters and initial centers, Canopy clustering algorithm is used to optimize K-Medoids algorithm. And Canopy-K-Medoids clustering algorithm is proposed;(2) Canopy algorithm is analyzed, and that it is random to select cluster centers and regional radius T1, T2 is found, thus, Canopy-K-Medoids algorithm is optimized by maximum and minimum distance algorithm, and a new algorithm is proposed, named HCK-Medoids clustering algorithm;(3) The three algorithms above are all implemented their parallel algorithms based on Map-Reduce, then tested on Hadoop platform, and are compared respectively in various aspects such as clustering accuracy and speedup. It is verified that the optimized algorithms are more efficient and accurate to process massive data. HCK-Medoids algorithm is applied to the customer segmentation, and a comparation is made with K-Means algorithm, which proves HCK-Medoids algorithm can segment customers more accurately.
Keywords/Search Tags:Hadoop, MapReduce, K-Medoids, Canopy, Clustering
PDF Full Text Request
Related items