Font Size: a A A

High-dimensional Data-oriented Clustering Algorithm Design And Tensor Low-rank Representation Research

Posted on:2021-03-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L ZhuoFull Text:PDF
GTID:1488306122979209Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering algorithms have been widely used and developed in the fields of exploring data visualization and discovering the underlying distribution of data.A large amount of unlabeled data is generated in industrial production activities,and it is of great significance in the field of data processing of clustering algorithms.When detecting clusters with irreg-ular shapes,current clustering algorithms usually face the problem of increasing algorithm parameters or decreasing clustering accuracy.The clustering algorithm based on density peak uses the decision graph to find the density peak,so that the underlying structure of the data can be found.This kind of algorithm can achieve a good compromise between al-gorithm parameters and clustering accuracy while efficiently identifying irregular clusters.However,when clusters in the data set have internal uneven distribution,the clustering algo-rithm based on density peaks may not be able to determine the correct center point,resulting in incorrect clustering.In actual Internet data,the data usually also has the characteristics of high feature dimension and sparse samples,which leads to the fact that the real cluster structure usually exists in the low-rank subspace corresponding to the samples.In addition,the diversification of the prior probability distribution and the incremental updating of data will cause the analysis process of the tensor low rank expression to become complicated.In view of the above problems,this paper analyzes clustering algorithms based on density peaks,researches efficient and accurate clustering algorithms for obtaining the true cluster distribution of data,and uses them for high-dimensional data clustering;A tensor decom-position method that can adapt to different data distributions and online learning can be adapted and extended to GPUs and multi-core GPUs.The main research work of the thesis is as follows:In order to solve the problem of uneven distribution of clusters in the data set,this study proposes a density peak algorithm based on hierarchical strategy.This research mainly includes the formation of sub-clusters and merged sub-clusters.First,in the process of forming sub-clusters,multiple data points with larger density and distance are selected as the candidate midpoint.On the one hand,it is possible to circumvent the difficulty of se-lecting the center point that meets the requirements,and on the other hand,it avoids the misclassification caused by selecting the wrong center point.Second,in the process of merging sub-clusters,an improved method that can simultaneously measure the connec-tivity and similarity between adjacent sub-clusters is proposed to reduce the difficulty of parameter setting.Relevant comparative experiments have been carried out on multiple UCI commonly used data sets,and the results prove that the algorithm in this paper solves the problem of uneven distribution of clusters in the data set.Aiming at the problem of uneven distribution of clusters in high-dimensional data sets,this paper proposes an improved subspace clustering algorithm based on multi-view and tensor low rank expression.First,this paper solves the problem of noise and data corruption in multi-view feature data of high-dimensional data sets by minimizing the norm of the error matrix 2,1;second,this paper stacks the relevant data of multi-view Into a tensor,and transform the optimization problem of solving multi-view subspace expression into a low-rank expression problem of tensor to fully consider the complementarity between views.Based on the above two aspects,this paper obtains a more accurate subspace expression from the high-dimensional data set,and further integrates and determines the final similarity matrix.Finally,this paper solves the problem of uneven distribution in clusters in high-dimensional data sets by combining a reasonable similarity matrix and HCFS algorithm.On multiple multi-view face data sets,comparative experiments with several other typical subspace clustering algorithms are conducted,and the results prove the effectiveness of the prosed algorithm.The diversity of data distribution is ubiquitous,and it is necessary to independently de-rive the update rule of the factor matrix based on different data distributions,which brings trouble to the sparse non-negative tensor decomposition analysis.In response to this situ-ation,this paper proposes a general factor matrix update rule.First,based on the single-channel model,this paper designs an element-by-element update strategy,which conforms to the sparsity of the sparse tensor and avoids the generation of large-scale intermediate ma-trices.Second,by setting an adaptive training step,this paper guarantees the factor matrix loss function Monotonicity and non-negativity of factor matrix elements;finally,this paper analyzes and derives the update rule of the factor matrix under different data distributions,and proposes a general update rule of the factor matrix to adapt to a variety of different data distributions.In addition,by decomposing the solution of the entire factor matrix element into multiple independent row elements,a parallel design is realized to a certain extent.On a number of real sparse tensor data sets,comparison experiments with other sparse non-negative tensor decomposition models were carried out.The results proved the efficient convergence and accuracy of this algorithm and its applicability to different data distribution.As the amount of data continues to increase,a single GPU cannot load and process the entire data set.In addition,with the rapid development of the Internet,the speed of data updating has also accelerated;ignoring real-time data may cause a large loss of informa-tion,and reprocessing the entire data set will cause a lot of waste of resources.To address the first problem,this paper proposes a factor matrix update rule based on multi-core GPUs by studying the communication principles between multiple GPUs.At the same time,this paper combined with the element-by-element update strategy to solve the problem of large computation and storage overhead brought by the temporary matrix in the existing paral-lelization and optimization algorithm.In response to the second problem,based on online learning strategies,this paper proposes a factor matrix update rule for real-time data.In ad-dition,this paper also improves the storage structure of the CSF tree and proposes a method to merge old data with new data.Sparse non-negative tensor decomposition experiments on1,4,and 8 GPUs have been performed on multiple high-order data sets.The results prove the effectiveness and scalability of the multi-core GPU algorithm based on this paper;The experiment proves that the algorithm of online learning in this paper can achieve the pur-pose of reducing the consumption of computing resources and storage space without losing real-time data information.
Keywords/Search Tags:Density Peak Clustering, Uneven Distribution within The Cluster, Multiview, High-dimensional Sparse Data, Single-Thread Model, Generalized Factor Matrix Update Rules, Online Learning
PDF Full Text Request
Related items