Research On Clustering Methods Of Large Data Sets Based On Data Fields

Posted on:2022-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:T H Wei

Full Text:PDF

GTID:2518306527470204

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of information technology,a large amount of data has been generated,and data mining has become a key technology to integrate these data.The common analysis techniques used in data mining are clustering,classification,association rules,regression analysis,web mining,etc.This paper focuses on the study of clustering analysis of large-scale data sets using cluster analysis techniques.We introduce the data field into the number field space by borrowing the idea of the interaction between matter particles in the physical field.By studying the gravitational relations existing between data points and the parameters of the influence factors of the data field,we propose a relative mass calculation method based on the data field.In the data field,we select the point with larger mass as the initial clustering centroid,which improves the problem of large randomness of the traditional clustering algorithm in selecting the initial centroid.By studying the radius of the force range of the data field,the value of the radius parameter is used to improve the parameter selection problem of the traditional clustering algorithm.For large-scale data sets,we adopt a distributed computing approach to improve the computational efficiency of the algorithm.The main work of this study is as follows.(1)The concept of data field originates from the physical field,and we propose a data field-based relative quality calculation method for solving the problem,the quality of clustering is affected by the selection of initial centroids because of the gravitational relationship between data points.The efficiency of solving the data quality is improved by using distributed computation in a large-scale data set.(2)We find out the relative mass of data points by RM algorithm,and select the first N points with larger relative mass as the points to be selected for the initial centroid of K-means algorithm.This scheme effectively improves the situation that the random selection of points by K-means algorithm leads to unstable clustering results.By discussing the selection of K-values for the K-means algorithm inspired by the influence factor parameters in the data field,we designed parallelization experiments and tested them on a large-scale data set,which effectively improves the computational efficiency of the algorithm.(3)We improve the DBSCAN clustering algorithm by finding out the relative mass of data objects through RM algorithm,and select the object with larger mass as the initial marker point of density clustering algorithm.For the problem of difficult selection of the neighborhood radius parameter in the DBSCAN clustering algorithm,we provide a reference for the domain radius parameter in DBSCAN clustering algorithm by optimizing the value of the influence factor in the data field.Moreover,we designed parallelization experiments and tested them on a large-scale dataset to speed up the clustering processing.

Keywords/Search Tags:

The data field, clustering algorithm, parallelization, relative mass

PDF Full Text Request

Related items

1	The Research Of Clustering Algorithms Based On Data Mass And Potential Entropy
2	An Improved Fast Clustering Algorithm And The Related Parallelization Research
3	Research On Parallelization Of Data Stream Clustering Algorithm For Police Data
4	Clustering algorithm for mass spectrometry data using general-purpose computing on graphics processing units
5	Research And Application Of Parallelization Optimization Of Spatial Clustering Algorithm Based On Spark
6	Research On Uncertain Data Clustering Algorithm And Its Parallelization
7	Research On Dynamic Clustering And Incremental In Data Mining
8	A High Dimensional Data Stream Clustering Algorithm Of Quick Dimension Reduction
9	Research Of Clustering Algorithm Based On Relative Density
10	Research And Implementation Of Large-Scale And Efficient Clustering Algorithm Based On Spark