Clustering Of Large-scale And High-dimensional Datasets Based On Graph Segmentation

Posted on:2023-11-16

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhao

Full Text:PDF

GTID:2530306848977529

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the data displaying large-scale and complex characteristics,the current clustering models cannot effectively aggregate data.Ensemble Clustering models based on large-scale and high-dimensional data are proposed.In recent years,in the field of large-scale data integration clustering,spectral clustering algorithm has become the basic model in this field.The spectral clustering algorithm can effectively mine the cluster structure of large-scale data and apply the effective cluster structure information to the fields of medicine,biology,meteorology,etc.,thus promoting the development of domain information.With the trend of data High-dimensional,high-dimensional data entranced the fields of medicine,biology,etc.Due to the complexity of high-dimensional structure.Based on the problem of incomplete information mining,a subspace mining model is proposed to improve the efficiency of cluster structure mining,as well as mining information integrity.Through the study of large-scale high-dimensional datasets,it is found that large-scale and high-dimensional datasets contain diverse and effective information which need to be mined,and the basic large-scale or high-dimensional data integration clustering model cannot effectively aggregate such datasets,so a linear coupling method is proposed to couple the basic model to formulate a unified Ensemble Clustering framework to aggregate large-scale and high-dimensional datasets.The mined information is applied to medical data analysis,image detection,pattern recognition,etc.pattern recognition,etc.,to promote the development of domain information.The exponential growth of data causes the following two problems: 1)the data size exceeds the memory setting threshold of the spectral clustering algorithm,resulting in memory overflow;2)the similarity values are directly calculated between data samples,and the similarity values do not completely map the original data set.With regard to the above two problems,an Ensemble Clustering model based on Natural Neighbor coding is proposed.First,a Natural Neighbor random mixing strategy is introduced to reduce the original sample volume and memory cost consumption.Then,search the set of the samples exemplar,as well as the set of natural neighbors that represent the point map,and code the relationship between the exemplar and the original samples,and a sparse submatrix is constructed based on the coding relationship.Secondly,the sparse submatrices are abstracted as graph models,and graph partitioning is used to obtain single base clustering results.Finally,clusters are constructed by merging multiple base clustering results.Sample-cluster similarity matrices are constructed based on clusters.The transfer graph-division theorem is introduced to prove the cluster similarity matrix and transfer similarity matrix,which are equivalent to the same graph model,and the segmentation equivalent graph model is used to obtain consensus clustering results.The model is selected with random exemplar,maps diverse coding relations.Multi-basis clusters are generated based on the coding relations,which improves the consensus clustering effect.High-dimensional spatial cluster structure is hidden in low-dimensional subspaces.The following two problems exist in the process of mining sub-spatial cluster: 1)sub-space is based on a uniform evaluation system,leading to homogenization of similarity calculation between data.2)the obtained information from multiple base clusters is largely diversified.To address the above problems,we propose a natural neighbor Gaussian kernel-based metric and entropy weighting model.First,we construct a natural neighbor-based Gaussian kernel with parameters updated to generate a diversity evaluation system,match multiple subspaces and evaluation systems,form a subspace-evaluation system pair space,and obtain base clustering results in this space.Secondly,multiple base clusters are merged to form clusters,entropy value is introduced to calculate the entropy value of each cluster,and the cluster with large entropy value is selected to construct a two-part graph model.Finally,the consensus clustering results are obtained based on the normalized graph cut algorithm.The model takes a pool of Gaussian functions and random subspaces,maps diverse subspace-evaluated body pairs structure,obtains diverse base clusters based on this structure,screens high quality base clusters by entropy weighting,and improves the consensus clustering results of the model based on this.Large-scale and high-dimensional data sets have large data volume and high data dimension.The following two problems exist in the data set aggregation.1)Large-volume and high-dimension data lead to memory overflow in the model.2)Information loss occurs when the scale is reduced(volume and dimension).To address the above problems,a graph partitioning large-scale high-dimensional data integration clustering model is proposed.First,the original high-dimensional space is randomly sampled to form multiple random subspaces,such that information loss is reduced.Second,the subspaces are based on the Natural Neighbor random mixing strategy to reduce the data volume.Sparse submatrices are constructed,and graph models are generated to obtain the base clustering results.Finally,clusters are constructed by merging multiple base clustering results,the clusters are mapped to graph models,and the partitioned graph models obtain consensus clustering results.The model is based on natural neighbor random mixing strategy,subspace evaluation system on structure,mapping diversified base clustering.Based on this,the model consensus clustering effect is improved.Based on the above three model studies,it is found that the data aggregation effect of the model proposed in this paper is better than other comparative models,which reflects the efficiency of bipartite graph partitioning and the advantage of randomized mapping diversified system.Reflecting the diversity quality base clustering,better clusters can be constructed and more optimal consensus clustering results can be obtained.

Keywords/Search Tags:

Bipartite Graph Partitioning, Natural Neighbors, Randomized Mapping Diversity, Entropy Weighting, Ensemble Driven

PDF Full Text Request

Related items

1	Research On Parameters And Cycle Structures In Bipartite Graphs
2	Research And Application Of Distributed Large-scale Graph Partitioning Algorithm
3	Research On Counting The Number Of （m,n） Bipartite Cliques On Large-scale Bipartite Graph
4	Local Graph Partitioning For Large-scale Graph
5	Research On Distributed Partitioning And Path Finding Algorithms For Large-Scale Natural Graph
6	Optimization On Partitioning Methods And Processing Workflow Of Distributed Graph-Processing Systems
7	Research On Graph Partitioning Algorithms With Paths
8	Bayesian Network Structure Learning Based On Graph Partitioning
9	Graph Partitioning Algorithm For Distributed Graph Computation
10	Local 2 - Arc Pass Complete Bipartite Graph