Research On Density-Based Clustering Algorithm For Numerical Big Data

Posted on:2018-04-07

Degree:Master

Type:Thesis

Country:China

Candidate:J D Liu

Full Text:PDF

GTID:2428330545498580

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,the data is exploding and the data dimension is also increasing.Based on the traditional clustering algorithm,a high-dimensional data clustering algorithm based on density is proposed in the MapReduce environment.It can improve the speed and quality of a large number of high-dimensional data clustering.In order to cluster big data effectively,two big data clustering algorithms are proposed,which are DBSCAN(Entropy-based and DBSCAN),and clustering algorithm to dynamically determine the optimal number of clusters(The Optimal Number of Clusters ENDBSCAN,OP-ENDBSCAN).DBSCAN takes the information entropy as the main consideration when clustering,and avoids that traditional DBSCAN algorithm needs to define two parameters of Eps(neighborhood radius)and Minpts(density threshold).At the same time,in order to solve the problem of big data volume and high dimension,a data preprocessing method is proposed.The method divides the data into blocks,divides them into different computer nodes for processing,the computational capability of the computer node is used to improve the efficiency and expansibility of the clustering algorithm.Based on the analysis of ENDBSCAN,it is found that the algorithm needs to determine the number of clusters.In order to solve this problem,an algorithm is proposed to dynamically determine the optimal number of clusters and evaluate the quality of clustering quality.In order to further improve the efficiency of the algorithm,the data preprocessing stage and merge phase of ENDBSCAN are applied to the MapReduce programming model,which improves the efficiency of the algorithm.In this thesis,KDDCUP1999 and other data sets were used to evaluate the effectiveness of the proposed algorithm under different parameter settings.Experiments show that compared with traditional DBSCAN,ENDBSCAN has higher accuracy and higher efficiency,and OP-ENDBSCAN has higher accuracy and expansibility than ENDBSCAN when clustering.At the same time,both ENDBSCAN and OP-ENDBSCAN can show high efficiency under the data sets of different sizes.

Keywords/Search Tags:

Big data, Clustering algorithm, Information entropy, Optimal number of clusters, MapReduce

PDF Full Text Request

Related items

1	Research On Determining The Number Of Clusters Based On Information Entropy
2	Research Andapplication On Determining Optimal Number Of Clusters In Cluster Analysis
3	Research On Determining Optimal Number Of Clusters In Cluster Analysis
4	Research On Determining Optimal Number Of Clusters In Cluster Analysis
5	Study On Parameter-free Peak Clustering Algorithm
6	Some Problems Of Determining The Optimal Number Of Clusters In Clustering Analysis
7	An Automatic Method To Determine The Number Of Clusters Based On Multi-Validity Indices
8	Algorithms Implementation Of Determining The Number Of Clusters And Initial Cluster Centers For Mixed Data
9	GMM Trees And Forests:Hierarchical Algorithms For Estimating The Number Of Clusters In High Dimensional Complex Data
10	Research On Key Technologies Of Resource Scheduling In MapReduce