Font Size: a A A

Research On Density-Based Clustering Algorithm For Numerical Big Data

Posted on:2018-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:J D LiuFull Text:PDF
GTID:2428330545498580Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the data is exploding and the data dimension is also increasing.Based on the traditional clustering algorithm,a high-dimensional data clustering algorithm based on density is proposed in the MapReduce environment.It can improve the speed and quality of a large number of high-dimensional data clustering.In order to cluster big data effectively,two big data clustering algorithms are proposed,which are DBSCAN(Entropy-based and DBSCAN),and clustering algorithm to dynamically determine the optimal number of clusters(The Optimal Number of Clusters ENDBSCAN,OP-ENDBSCAN).DBSCAN takes the information entropy as the main consideration when clustering,and avoids that traditional DBSCAN algorithm needs to define two parameters of Eps(neighborhood radius)and Minpts(density threshold).At the same time,in order to solve the problem of big data volume and high dimension,a data preprocessing method is proposed.The method divides the data into blocks,divides them into different computer nodes for processing,the computational capability of the computer node is used to improve the efficiency and expansibility of the clustering algorithm.Based on the analysis of ENDBSCAN,it is found that the algorithm needs to determine the number of clusters.In order to solve this problem,an algorithm is proposed to dynamically determine the optimal number of clusters and evaluate the quality of clustering quality.In order to further improve the efficiency of the algorithm,the data preprocessing stage and merge phase of ENDBSCAN are applied to the MapReduce programming model,which improves the efficiency of the algorithm.In this thesis,KDDCUP1999 and other data sets were used to evaluate the effectiveness of the proposed algorithm under different parameter settings.Experiments show that compared with traditional DBSCAN,ENDBSCAN has higher accuracy and higher efficiency,and OP-ENDBSCAN has higher accuracy and expansibility than ENDBSCAN when clustering.At the same time,both ENDBSCAN and OP-ENDBSCAN can show high efficiency under the data sets of different sizes.
Keywords/Search Tags:Big data, Clustering algorithm, Information entropy, Optimal number of clusters, MapReduce
PDF Full Text Request
Related items