Research On Clustering Algorithm For Large-Scale High-Dimensional Data

Posted on:2022-10-22

Degree:Master

Type:Thesis

Country:China

Candidate:T F Li

Full Text:PDF

GTID:2518306353477044

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

Clustering can classify and combine unlabeled data according to their similarities,and group data with high similarity.However,with the development of the times,the scale and dimensions of data have also increased,and traditional clustering algorithms can no longer meet the current data clustering needs.In order to solve the current clustering problem of large-scale high-dimensional data,this paper analyzes the reasons for the failure of traditional clustering algorithms on large-scale high-dimensional data,and proposes corresponding solutions based on these reasons.It improves the efficiency of existing clustering algorithms on large-scale high-dimensional data.Aiming at the clustering problem of large-scale high-dimensional data,this article mainly does the following work:(1)Improve on the traditional reservoir sampling algorithm,and combine the ideas of parallelism and sampling to construct a tree iteration that introduces parallelism.Reservoir sampling algorithm and use it to process large-scale data scenarios.(2)Propose a feature selection algorithm based on gridding and subspace similarity,and at the same time introduce a missing information compensation formula to ensure that the dimensionality of high-dimensional data is reduced while retaining more information from the original data.(3)Combined with the proposed large-scale and high-dimensional data processing method,an adaptive clustering algorithm suitable for large-scale high-dimensional data is constructed.The algorithm consists of two parts: First,the concept of data resolution is proposed,and it is combined with the entropy method to be used in the initial cluster center determination process.This method accelerates the initial cluster center determination by introducing data resolution.Process;Secondly,a clustering algorithm based on cluster center drift k-means combined with multi-seed DBSCAN is constructed to perform final clustering on the processed data.For the method proposed in this article,the corresponding data set is selected in the UCI machine learning repository for verification.The experimental results verify the effectiveness of the method proposed in this article.

Keywords/Search Tags:

Tree-Shaped Iterative, Gridding, Subspace similarity, Data resolution, Entropy method, Cluster center Drift

PDF Full Text Request

Related items

1	Research On Subspace Cluster Algorithms On Simil Arity And DBSCAN
2	Research On Cluster Tree Method For Textual Stream Classification
3	Applications Of Statistical Learning To Image Super-resolution And Tree-shaped Data
4	Studies On Clustering Algorithms For Categorical Data
5	Optimization Of Air Distribution And Thermal Environment In Data Center By U-shaped Channel And Auxiliary Fan
6	A High Dimensional Data Stream Clustering Algorithm Of Quick Dimension Reduction
7	Video Super-resolution Method Research Based On Similarity Constraints
8	Finding natural clusters through entropy minimization
9	Research On Data Stream Classification Method Based On Concept Drift Detection
10	Research On Algorithms For Subspace Clustering And Outlier Mining Based-on Information-entropy