Font Size: a A A

Research On Asymptotic Theory And Methods For Clustering Ultra-high Dimensional Data

Posted on:2024-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:W LvFull Text:PDF
GTID:2568307115963879Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of the Internet and big data,human society generates hundreds of millions of data every day.These data contain huge amount of information and unimaginable value.Unsupervised learning,which does not require a lot of human and financial resources to label the data,has received more and more attention from all fields of society.As an important research direction in the field of unsupervised learning,cluster analysis has also received unprecedented development,and at the same time encountered some difficult problems.The problem of clustering high-dimensional data such as images,text,time series,and even ultra-high-dimensional data such as gene sequences and protein sequences has become a major difficulty in the development of clustering field nowadays.The curse of dimensionality leads to a dramatic degradation of the clustering performance of clustering algorithms on high-dimensional data and even ultra-high-dimensional data.The curse of dimensionality is mainly shown in four aspects: data sparsity,distance convergence,complex and variable feature relationships,and time space resource consumption.Since clustering is defined as the clustering of similar data into one category and unlike data into different categories,and in vector space,similarity often appears in the form of distance metric,the high-dimensional distance convergence phenomenon has a particularly great impact on high-dimensional clustering.The problem of a large range of high-dimensional spaces,which appears to be a very sparse data set sample data,has a greater impact on the current statistical learning-based neural network models.For curse of dimensionality to be faced by high-dimensional tasks,the main research of this paper is as follows:(1)Gain insight into the theory of asymptotics of high-dimensional data,which investigates the mathematical nature of the curse of dimensionality,and experimentally verify the theory and its applicability to the real world.The existing asymptotic theory of highdimensional data no longer considers that the phenomenon of distance convergence only has a negative impact on high-dimensional data clustering? it asserts that samples from different distributions(with different distribution means and variances)have different values of distance convergence in the high-dimensional case.In this case,the distances between samples from different distributions are instead easier to be distinguished in the high-dimensional case due to the difference in their convergence values,hence the phenomenon is called dimensional blessing.Through experiments,it is found that the phenomenon of dimensional blessing does exist in a single ideal distribution or a mixed distribution with less overlap,but there is a possibility that the values of the distance tends to be equal between samples from different distributions,which greatly affects the clustering effect of high-dimensional data.Especially when encountering real-world data sets,the theory will be difficult to play the satisfactory role because of the complexity of the data distribution,whose tendency values are mixed together again.(2)To address the problems of complex feature relationships and distance convergence affecting the clustering effect in dimensional disasters,a deep clustering method that can learn to merge the relationships between features and use dimensionality reduction to avoid the effect of distance convergence is proposed.The method introduces the idea of feature weighting based on the deep self-encoder and force-directed graph distribution algorithm,which enables the model to both merge high-dimensional features into fewer lowdimensional key features guided by the reconstruction goal and highlight the cluster structure based on the data nearest neighbor structure,as well as to take into account the suitability of the features for the clustering task.Detailed experiments on multiple datasets demonstrate that the approach successfully bypasses the effects of the curse of dimensionality and greatly improves the clustering performance of high-dimensional data clustering.This thesis studies the curse of dimensionality of high-dimensional clustering and even ultra-high-dimensional clustering,and investigates both theoretical and methodological aspects.The shortcomings of the current high-dimensional asymptotic theory are pointed out,and a new clustering algorithm is proposed to provide a new idea for the research of highdimensional clustering.
Keywords/Search Tags:Ultra-high dimensional clustering, deep clustering, distance convergence, data dimensionality reduction
PDF Full Text Request
Related items