Research On Adaptive Parameter Of DBSCAN Algorithm And Its Application On Spark Platform

Posted on:2018-10-20

Degree:Master

Type:Thesis

Country:China

Candidate:L Tang

Full Text:PDF

GTID:2348330518998526

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years, the rise of cloud computing promotes the development of large data sets platform. Data processing is the core of data mining and knowledge discovery, As a large data processing platform, Spark is through the introduction of RDD model to improve the speed of its data processing, to meet with the business, scholars of data processing needs. As an important part of data mining, DBSCAN algorithm is a typical representative of this kind of algorithm, and it can identify any cluster of clusters from the noisy data set, and has a good clustering effect. However, the algorithm is more sensitive to input parameters, and can not achieve parameter adaptation, especially for non-uniform data set clustering,global parameters seriously affect the clustering effect, and its computational complexity is high, the processing of massive data efficiency is low. This paper presents a solution to these problems, the main work has the following aspects:(1) In this thesis, a DBSCAN improved algorithm based on partitioning is proposed to solve the problems that can not effectively handle non-uniform datasets and are sensitive to parameters. The KNN matrix of the data set is obtained by calculation, and the density transition threshold is obtained by the information in the KNN matrix. In this paper, This data is divided into different density data sets. The threshold MinPts and the neighborhood radius are automatically determined by the value of the clustering effect index, finally, the local clustering of the dynamic neighborhood is performed by using MinPts and corresponding Eps for each data set of each data set,and then merges the resulting results. Experiments show that the improved algorithm PDBSCAN is superior to DBSCAN algorithm,VDBSCAN algorithm and AGD-DBSCAN algorithm on the adaptive parameters and dealing with non-uniform data sets.(2) In order to reduce the running time of the algorithm and the consumption of I/O,this paper realizes the parallelization of the improved algorithm which is based on Spark.Experiments show that the parallelization of the improved algorithm can effectively reduce the running time of the algorithm and fully demonstrate the superiority of Spark dealing with large data. Finally, the behavior data mining of Internet user in the interest,personalized recommendations for the effective decision support.

Keywords/Search Tags:

DBSCAN, clustering, Spark, user behavior

PDF Full Text Request

Related items

1	KDSG-DBSCAN:A High Performance DBSCAN Algorithm Based On K-D Tree And Spark GraphX
2	Implementation And Application Of Clustering Algorithm Based On Spark
3	A Research About DBSCAN Text Clustering Based On Spark Platform
4	The Optimization Of Clustering And Classification Algorithms Based On SPARK
5	Research And Implementation Of Clustering Algorithm For Massive Mobile Internet User Behavior
6	Research On Parallization Of DBSCAN Clustering Algorithm For Spatial Data Mining Based On Spark Platform
7	The Design And Implementation Of User Behavior Analysis System Based On Spark
8	Research And Implementation Of User Behavior Analysis System Based On Spark
9	Design And Implementation Of A User Behavior System For Query Logs Based On Spark
10	User Behavior Analysis Algorithm And Its Application On Spark