Font Size: a A A

Research And Implementation Of Clustering Algorithm Based On Hadoop Platform

Posted on:2018-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z LvFull Text:PDF
GTID:2518306248982919Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In data mining,clustering analysis is an important part of it.It is a kind of unsupervised classification technology.In the absence of prior knowledge,the data analysts can use the clustering analysis technology to obtain the intrinsic relation and distribution rule of the data from the data set.Due to the rapid development of social informatization,the amount of data needed for clustering operations is increasing rapidly.The traditional clustering algorithm does not have the ability to deal with the current data sets.Hadoop is the popular distributed platform for dealing with large data sets.It provides a strong support for the development and research of clustering analysis to solve this problem.This paper is focus on the problem that the traditional clustering algorithm is inefficient in large-scale data,and come up with a solution to optimize and improve the clustering algorithm based on the Hadoop distributed platform.The major works are as follows:1)Researching on k-means algorithm which based on partition.The characteristics and implementation process of the algorithm is analyzed,also its shortcomings.To solve the problem which selecting initial centroids randomly caused clustering results are not ideal,and it takes long time in the large data set,the ADC-k-means algorithm which based on Hadoop is proposed.It can improve the stability and accuracy of the k-means algorithm.2)Researching on DBSCAN algorithm which based on density.The execution characteristics and shortcomings of the algorithm are analyzed.By combining the Canopy clustering algorithm and the k-d tree data structure,and using the method which reducing the query range of the ?-neighborhoods of object,the ADC-k-means algorithm which based on Hadoop is proposed.Under the condition of the accuracy is not reduced,the efficiency of the algorithm is improved.3)Experiments are performed on the Hadoop platform:(1)Using UCI common data sets as experimental data.It is verified that the ADC-k-means algorithm in terms of stability and accuracy is better than the k-means algorithm implemented by Mahout which is the machine learning library on the Hadoop.The accuracy rate has risen by an average of 8%.(2)The C-DBSCAN-K algorithm is significantly faster than the DBSCAN algorithm on four data sets which are generated by the R packages.The running speed has been increased by 50.6%on average.
Keywords/Search Tags:Clustering analysis, Hadoop, ADC-k-means, C-DBSCAN-K
PDF Full Text Request
Related items