Font Size: a A A

Research On Clustering Methods Of Large Data Sets Based On Data Fields

Posted on:2022-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:T H WeiFull Text:PDF
GTID:2518306527470204Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology,a large amount of data has been generated,and data mining has become a key technology to integrate these data.The common analysis techniques used in data mining are clustering,classification,association rules,regression analysis,web mining,etc.This paper focuses on the study of clustering analysis of large-scale data sets using cluster analysis techniques.We introduce the data field into the number field space by borrowing the idea of the interaction between matter particles in the physical field.By studying the gravitational relations existing between data points and the parameters of the influence factors of the data field,we propose a relative mass calculation method based on the data field.In the data field,we select the point with larger mass as the initial clustering centroid,which improves the problem of large randomness of the traditional clustering algorithm in selecting the initial centroid.By studying the radius of the force range of the data field,the value of the radius parameter is used to improve the parameter selection problem of the traditional clustering algorithm.For large-scale data sets,we adopt a distributed computing approach to improve the computational efficiency of the algorithm.The main work of this study is as follows.(1)The concept of data field originates from the physical field,and we propose a data field-based relative quality calculation method for solving the problem,the quality of clustering is affected by the selection of initial centroids because of the gravitational relationship between data points.The efficiency of solving the data quality is improved by using distributed computation in a large-scale data set.(2)We find out the relative mass of data points by RM algorithm,and select the first N points with larger relative mass as the points to be selected for the initial centroid of K-means algorithm.This scheme effectively improves the situation that the random selection of points by K-means algorithm leads to unstable clustering results.By discussing the selection of K-values for the K-means algorithm inspired by the influence factor parameters in the data field,we designed parallelization experiments and tested them on a large-scale data set,which effectively improves the computational efficiency of the algorithm.(3)We improve the DBSCAN clustering algorithm by finding out the relative mass of data objects through RM algorithm,and select the object with larger mass as the initial marker point of density clustering algorithm.For the problem of difficult selection of the neighborhood radius parameter in the DBSCAN clustering algorithm,we provide a reference for the domain radius parameter in DBSCAN clustering algorithm by optimizing the value of the influence factor in the data field.Moreover,we designed parallelization experiments and tested them on a large-scale dataset to speed up the clustering processing.
Keywords/Search Tags:The data field, clustering algorithm, parallelization, relative mass
PDF Full Text Request
Related items