Font Size: a A A

Research On Improved DBSCAN Algorithm Based On Spark Platform

Posted on:2021-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2428330629486186Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and wide application of information technology in the current era,people's production and lifestyle are being deeply influenced by Internet services.A sea of data is also generated,which makes it more and more significant to use data mining tools to select valuable information from original miscellaneous data to guide social production and life.As one of the widely used clustering algorithms and the conventional data mining method,the DBSCAN algorithm has the advantages of being able to find clusters of arbitrary shapes and the clustering effect is not affected by noise points.However,the algorithm also has some defects such as the following: first,when the data size is large,a considerable amount of memory capacity is required to support its operation;second,the clustering result is sensitive to the input parameters,which makes the setting of the algorithm parameters a little more complicated,and finally In the face of unevenly distributed data,it is difficult for the algorithm to achieve a better clustering effect.In view of the above defects,the improved DBSCAN algorithm combined with the Lightning Attachment Procedure Optimization Algorithm(LAPO)is proposed in the thesis,and the algorithm is parallelized under the new generation of large-scale data processing framework Spark.The main research contents of the thesis are summarized as follows:(1)A method for obtaining cluster centers based on LAPO algorithm is proposed.The K-means algorithm has the disadvantages of being highly dependent on data and sensitive to the selection of initial cluster centers.Iterative search in the intelligent optimization algorithm is used to replace the progressive type center search method in the traditional K-means algorithm to obtain more compact cluster.The LAPO algorithm has excellent search capabilities,and iteratively finds the category contours of the data,thereby overcoming the problems of the sensitivity of the K-means algorithm and the low cluster compactness.(2)An improved DBSCAN algorithm combined with the Lightning Attachment Procedure Optimization Algorithm(LAPO-DBSCAN)is designed,and the clustering center acquisition method based on LAPO algorithm is used in the data partitioning stage of the improved DBSCAN algorithm.The improved DBSCAN algorithm is divided into three main steps: data division,local clustering,and clustering result merging.The main advantages of the improved DBSCAN algorithm are that the traditional DBSCAN algorithm has reduced memory requirements,the algorithm is easier to use,and the clustering effect is improved.Finally,the comparative experimental analysis confirmed the good clustering performance of the LAPO-DBSCAN algorithm.(3)The parallel operation of the LAPO-DBSACN algorithm on the Spark platform is implemented in the thesis.With the help of the highly efficient and reliable computing power provided by the distributed computing framework,the thesis studies the parallelization strategy of the LAPO-DBSCAN algorithm under the Spark platform,and realizes the paralleliz...
Keywords/Search Tags:spark platform, DBSCAN algorithm, Lightning Attachment Procedure Optimization algorithm, parallel computing
PDF Full Text Request
Related items