Font Size: a A A

Research On Adaptive Parameter Of DBSCAN Algorithm And Its Application On Spark Platform

Posted on:2018-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:L TangFull Text:PDF
GTID:2348330518998526Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years, the rise of cloud computing promotes the development of large data sets platform. Data processing is the core of data mining and knowledge discovery, As a large data processing platform, Spark is through the introduction of RDD model to improve the speed of its data processing, to meet with the business, scholars of data processing needs. As an important part of data mining, DBSCAN algorithm is a typical representative of this kind of algorithm, and it can identify any cluster of clusters from the noisy data set, and has a good clustering effect. However, the algorithm is more sensitive to input parameters, and can not achieve parameter adaptation, especially for non-uniform data set clustering,global parameters seriously affect the clustering effect, and its computational complexity is high, the processing of massive data efficiency is low. This paper presents a solution to these problems, the main work has the following aspects:(1) In this thesis, a DBSCAN improved algorithm based on partitioning is proposed to solve the problems that can not effectively handle non-uniform datasets and are sensitive to parameters. The KNN matrix of the data set is obtained by calculation, and the density transition threshold is obtained by the information in the KNN matrix. In this paper, This data is divided into different density data sets. The threshold MinPts and the neighborhood radius are automatically determined by the value of the clustering effect index, finally, the local clustering of the dynamic neighborhood is performed by using MinPts and corresponding Eps for each data set of each data set,and then merges the resulting results. Experiments show that the improved algorithm PDBSCAN is superior to DBSCAN algorithm,VDBSCAN algorithm and AGD-DBSCAN algorithm on the adaptive parameters and dealing with non-uniform data sets.(2) In order to reduce the running time of the algorithm and the consumption of I/O,this paper realizes the parallelization of the improved algorithm which is based on Spark.Experiments show that the parallelization of the improved algorithm can effectively reduce the running time of the algorithm and fully demonstrate the superiority of Spark dealing with large data. Finally, the behavior data mining of Internet user in the interest,personalized recommendations for the effective decision support.
Keywords/Search Tags:DBSCAN, clustering, Spark, user behavior
PDF Full Text Request
Related items