Font Size: a A A

Research Of Clustering Algorithm Based On Data Local Distribution

Posted on:2022-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:2518306341986999Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With integrating of industry information,the amount of data has grown exponentially,and the types of data are numerous and complex.Traditional methods are unable to mine valuable information from datasets with complex distributions quickly and effectively.Cluster analysis can find effective information based on the similaries of data points without prior knowledge.It has been widely used in many fields,such as image processing,intelligence retrieval,and atmospheric pollution.However,when processing different types of datasets,many clustering algorithms face the problem that they cannot correctly identify clusters with complex distributions.Aiming at effective idenfiy clusters with various densites and arbitrary shapes,by combing the mutual nearest neighbor technology with local distribution of data to construct an adaptive local density,this work has proposed two clustering algorithms,DBLCM and CDBTS.(1)The DBLCM algorithm proposes a new local center measure to reflect whether the data points can be divided into the core region according to the local distribution of data points.Besides the bi-directional absolute distance is defined based on the mutual k nearest neighbor information.According to the bi-directional absolute distance and the local center measure,the algorithm decides whether a data point is divided into a core region or a boundary region.Then the points among the core regions are merged with the high-density points in its neighbors to form initial clusters.Next,the points among the boundary region are assigned into the initial clusters where their nearest neighbors are.To verify its effectiveness,DBLCM and the six representative benchmarks are compared on the two-dimensional datasets and multi-dimensional datasets containing arbitrary shapes and densities,and the feasible range of the parameter k is also analyzed.The results show that DBLCM can effectively detect clusters with arbitrary shapes and densities.(2)The CDBTS algorithm first defines the local density of a data point according to the number of its inverse k nearest neighbors.It then partitions the datasets into three disjoint regions based on the tripartite strategy,i.e.,positive region,negative region and boundary region.The CDBTS algorithm adopts different methods to process the points in the three regions respectively.For the points in the positive region,it combines them with high-density points among its mutual neighbors to form the backbone;For a point in the negative region,if one of its k nearest neighbor is in a positive region,this point will be used to expand the boundary region;For the points in the boundary region,the algorithm assigns it into the backbone where the nearest neighbor is.Same as DBLCM,to verify the effectiveness of the algorithm,CDBTS and the seven benchmarks are compared on the two-dimensional datasets and multidimensional datasets containing arbitrary shapes and densities.In addition,sensitivity experiments are conducted to verify the ability of the algorithm to identify small clusters in imbalanced datasets.The results show that the CDBTS algorithm is better than benchmarks in identifying clusters with any shapes and densities.Furthermore it is better than benchmarks in identifying imbalanced datasets.The two clustering algorithms proposed in this study can identify clusters with arbitrary densities and shapes from datasets with complex distribution.The time complexity of the both are O(n·logn).
Keywords/Search Tags:Clustering, Local Center Measure, Local Density, Cluster with Arbitrary Shape and Density, Tripartite Strategy
PDF Full Text Request
Related items