Cluster analysis is an important data mining technology.Its goal is to mine clusters in data such that data points in the same cluster are more similar than data points in different clusters.Researchers have proposed a series of clustering algorithms and applied them to image segmentation,information retrieval,data compression and bioinformatics.In recent years,with the rapid development of a series of emerging technologies such as big data,blockchain and artificial intelligence,a large amount of complex structured data,such as data with irregular shape and data with uneven density,has been accumulated in the fields of Internet,scientific research and industrial production.Clustering these data poses serious challenges to traditional clustering algorithms.Therefore,how to cluster complex structured data has become a challenging research topic.In this dissertation,the research on clustering algorithm is carried out for complex structured data.The main research contents and research results are as follows:(1)For data with uneven density,this dissertation proposes a clustering algorithm based on local gap density(LGD).The proposed algorithm defines the local gap density of a data point according to the gap between the density of the data point and the highest density of its neighbors.Based on the local gap density,LGD algorithm first identifies the core points and border points in the data.Then,it defines latent cross-cluster edges according to whether the endpoints of the edges in the k-NN graph are border points and the weights of the edges.After deleting the cross-cluster edges in the k-NN graph,all data points on the branch containing more data points are taken as an initial cluster.Finally,for the unclustered data points,it selects representative points for them in the initial cluster,and assigns each unclustered data point to the initial cluster that contains its representative point.Experiments show that the effect of the proposed algorithm is better than the traditional and recent clustering algorithms.(2)For complex structured data,this dissertation proposes a clustering algorithm based on density decreased chain(DDC).DDC algorithm defines a density decreased chain on the mutual k-NN graph,where the density of the data point on the chain decreases sequentially,and the starting point of the chain is the data point with the local highest density.Using the density decreasing chain,data with complex structure can be well divided into core points and border points.To cluster data,DDC algorithm first defines the concept of intra-cluster density decreased chain to mine initial clusters in the data,and then hierarchically assigns the data points that do not belong to the initial clusters to the corresponding initial clusters based on the density decreased chain.The experimental results show that the effect of DDC is better than the related clustering algorithms for complex structured data.(3)For large-scale complex structured data,this dissertation proposes a clustering algorithm based on junction density(JDC).The proposed algorithm defines the junction density to measure the density of junction region of two subclusters,where the subclusters are obtained by dividing the data by K-means algorithm.Based on the junction density between two subclusters,JDC first redefines the concept of density reachability in DBSCAN,for taking the subclusters meeting the density reachability condition as the initial clusters.Then,it redefines the concept of representative points in LGD,for assigning the remaining groups to the corresponding initial clusters.Since JDC merges subclusters rather than directly clustering data points,its computational complexity is significantly reduced compared to those density-based algorithms that directly cluster data points.The efficiency and effectiveness of the proposed algorithm are verified on several complex structured datasets.(4)For the complex structure data with noise,this dissertation proposes a clustering algorithm named diffusion clustering(DC).The proposed algorithm first defines the diffusion distance of a data point according to the distance between the data point and its neighbors.Based on the diffusion distance of the data points and the average diffusion distance of their neighbors,it then divides the data into diffusible points and terminal points.Finally,DC algorithm defines the concept of a diffusion set of diffusible points for mining clusters in the data.In particular,the terminal points far from diffusible points are identified as noise.Experiments show that the proposed algorithm can well mine clusters of complex structured data and accurately identify noise in the data.Aiming at the problem of clustering complex structured data,this dissertation conducts a systematic study and defines some new concepts,such as local gap density,density decreased chain,junction density,diffusible point,etc.Based on these concepts,this dissertation proposes four new clustering algorithms for complex structured data,which enriches the research content of cluster analysis. |