Font Size: a A A

Research On Clustering Algorithm For Large-Scale High-Dimensional Data

Posted on:2022-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:T F LiFull Text:PDF
GTID:2518306353477044Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Clustering can classify and combine unlabeled data according to their similarities,and group data with high similarity.However,with the development of the times,the scale and dimensions of data have also increased,and traditional clustering algorithms can no longer meet the current data clustering needs.In order to solve the current clustering problem of large-scale high-dimensional data,this paper analyzes the reasons for the failure of traditional clustering algorithms on large-scale high-dimensional data,and proposes corresponding solutions based on these reasons.It improves the efficiency of existing clustering algorithms on large-scale high-dimensional data.Aiming at the clustering problem of large-scale high-dimensional data,this article mainly does the following work:(1)Improve on the traditional reservoir sampling algorithm,and combine the ideas of parallelism and sampling to construct a tree iteration that introduces parallelism.Reservoir sampling algorithm and use it to process large-scale data scenarios.(2)Propose a feature selection algorithm based on gridding and subspace similarity,and at the same time introduce a missing information compensation formula to ensure that the dimensionality of high-dimensional data is reduced while retaining more information from the original data.(3)Combined with the proposed large-scale and high-dimensional data processing method,an adaptive clustering algorithm suitable for large-scale high-dimensional data is constructed.The algorithm consists of two parts: First,the concept of data resolution is proposed,and it is combined with the entropy method to be used in the initial cluster center determination process.This method accelerates the initial cluster center determination by introducing data resolution.Process;Secondly,a clustering algorithm based on cluster center drift k-means combined with multi-seed DBSCAN is constructed to perform final clustering on the processed data.For the method proposed in this article,the corresponding data set is selected in the UCI machine learning repository for verification.The experimental results verify the effectiveness of the method proposed in this article.
Keywords/Search Tags:Tree-Shaped Iterative, Gridding, Subspace similarity, Data resolution, Entropy method, Cluster center Drift
PDF Full Text Request
Related items