Research And Implementation Of Clustering Algorithm Based On Hadoop Platform

Posted on:2018-09-18

Degree:Master

Type:Thesis

Country:China

Candidate:Z Lv

Full Text:PDF

GTID:2518306248982919

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

In data mining,clustering analysis is an important part of it.It is a kind of unsupervised classification technology.In the absence of prior knowledge,the data analysts can use the clustering analysis technology to obtain the intrinsic relation and distribution rule of the data from the data set.Due to the rapid development of social informatization,the amount of data needed for clustering operations is increasing rapidly.The traditional clustering algorithm does not have the ability to deal with the current data sets.Hadoop is the popular distributed platform for dealing with large data sets.It provides a strong support for the development and research of clustering analysis to solve this problem.This paper is focus on the problem that the traditional clustering algorithm is inefficient in large-scale data,and come up with a solution to optimize and improve the clustering algorithm based on the Hadoop distributed platform.The major works are as follows:1)Researching on k-means algorithm which based on partition.The characteristics and implementation process of the algorithm is analyzed,also its shortcomings.To solve the problem which selecting initial centroids randomly caused clustering results are not ideal,and it takes long time in the large data set,the ADC-k-means algorithm which based on Hadoop is proposed.It can improve the stability and accuracy of the k-means algorithm.2)Researching on DBSCAN algorithm which based on density.The execution characteristics and shortcomings of the algorithm are analyzed.By combining the Canopy clustering algorithm and the k-d tree data structure,and using the method which reducing the query range of the ?-neighborhoods of object,the ADC-k-means algorithm which based on Hadoop is proposed.Under the condition of the accuracy is not reduced,the efficiency of the algorithm is improved.3)Experiments are performed on the Hadoop platform:(1)Using UCI common data sets as experimental data.It is verified that the ADC-k-means algorithm in terms of stability and accuracy is better than the k-means algorithm implemented by Mahout which is the machine learning library on the Hadoop.The accuracy rate has risen by an average of 8%.(2)The C-DBSCAN-K algorithm is significantly faster than the DBSCAN algorithm on four data sets which are generated by the R packages.The running speed has been increased by 50.6%on average.

Keywords/Search Tags:

Clustering analysis, Hadoop, ADC-k-means, C-DBSCAN-K

PDF Full Text Request

Related items

1	Research Of Clustering Algorithm Based On Cloud Computing Platform
2	Research On Machine Learning Clustering Algorithms In The Hadoop Development Environment
3	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform
4	Clustering Analysis Based On Hadoop
5	Research On K-Means Clustering Algorithm Based On Hadoop Cloud Computing Platform
6	Application And Research Of DBSCAN Based On Hadoop Platform
7	Research On Text Clustering Algorithm Based On DBSCAN
8	Research On The Application Of User Behavior Analysis Based On Hadoop
9	Research And Application Of Clustering In Telecom Customer Differentiated Reminder Based On Hadoop
10	Chroma Clustering Analysis Of Film Poster Based On Hadoop