Application And Research On Clustering Algorithm In Large Scale High Dimensional Datasets

Posted on:2016-10-19

Degree:Master

Type:Thesis

Country:China

Candidate:J Deng

Full Text:PDF

GTID:2308330464964984

Subject:Computer Science and Technology

Abstract/Summary:

In recent years, along with the rapid development of cloud computing, internet of things and the social network, cumulative data scale of every domain is increasing rapidly. These mass data potentially contains a large number of useful information. Therefore, how to effectively collect and analyze these data to receive potential information has become the research hotspot and difficulty. As an important method of unsupervised learning in data mining, clustering analysis has been widely used in education, scientific research, the Internet and so on. Although the existing clustering algorithms can obtain higher clustering quality in dealing with small scale and low dimensional data, they may lead to lower clustering validity in dealing with large scale and high dimensional data. Hence, exploring a kind of approach to deal with large scale and high dimensional data clustering has become the key and difficulty. The thesis is based on the data scale reduction and has an intensive research for sampling technique. The main work of the thesis is summarized as follows:(1) Sampling has been widely used in large scale data clustering analysis. In order to overcome the defect of low quality sample by manually set the sample size, the thesis proposes sampling clustering algorithm of self-adaptive sample size on the basis of existing statistical optimal sample size algorithm. The improved algorithm adds the operation of removing the high dimension data redundant features. It can effectively deal with large scale high dimensional data. Extensive experimental results on UCI datasets demonstrate that the sample set of special sample size is obtained by the improved algorithm have higher quality.(2) Although the existing variable grid division density biased sampling clustering algorithm can efficiently deal with large scale data, it can lead to low efficiency in dealing with high dimensional data own to it need to deal with every dimension. Aim to this, the thesis proposes an efficient density biased sampling clustering algorithm. The improved algorithm, firstly, research has been carried out about the importance of different high dimensional data feature in cluster space and an efficient feature selection method for high dimensional data is designed, secondly, combining the method with the variable grid density biased sampling clustering algorithm, so that the improved algorithm can effectively deal with large scale high dimensional data. Extensive experimental results on artificial and UCI datasets demonstrate the effectiveness of the proposed algorithm.(3) In order to further show practical application of the proposed efficient density biased sampling algorithm, adding the algorithm into Weka, which is an open-source data mining platform, and testing among the algorithm with the others sampling algorithms in Weka on the real large scale high dimensional dataset. Experimental results demonstrate the proposed algorithm can obtain higher sample quality and can achieve higher clustering performance in the sample set. As a consequence, the clustering analysis of large scale high dimensional datasets can be implemented effectively.

Keywords/Search Tags:

clustering analysis, sampling, variable grid division, feature selection, large scale high dimensional dataset

Related items

1	Application And Research On Clustering Algorithm In Large Scale Datasets
2	Estimation Of The Number Of Clusters On High-dimensional Large-scale Dataset
3	Variable Selection For Gaussian Mixture Model-Based Clustering And Its Application
4	Statistical Analysis Of High-dimensional Data Based On Feature Selection
5	Application Of Grid And Density Based Clustering Algorithm In Data Mining
6	Feature Selection And Clustering For High-dimensional Data
7	Clustering Feature Tree For Large-Scale Support Vector Machines
8	A Two-stage Hybrid Ant Colony Optimization Algorithm For High-dimensional Feature Selection
9	Research On Feature Selection Algorithm In Data With Large Scale And High Dimension Based On Evolutionary Multi-Objective Optimization
10	Research On Large-scale Regularized Machine Learning Algorithms